how to decide driver memory in spark

For example, if you join two tables through Spark SQL, Spark's CBO may decide to broadcast a smaller table (or a smaller dataframe) across to make join run faster. By clicking Post Your Answer, you agree to our terms of service, privacy policy and cookie policy. Is there a suffix that means "like", or "resembling"? Start with --executor-cores 2, double --executor-memory (because --executor-cores tells also how many tasks one executor will run concurently), and see what it does for you. Your environment is compact in terms of available memory, so going to 3 or 4 will give you even better memory utilization.

apache-sparkhadoopmemorymemory-managementout-of-memory. Where developers & technologists share private knowledge with coworkers, Reach developers & technologists worldwide. Trending is based off of the highest score sort and falls back to it if no posts are trending. If I run the program with the same driver memory but higher executor memory, the job runs longer (about 3-4 minutes) than the first case and then it will encounter a different error from earlier which is a Container requesting/using more memory than allowed and is being killed because of that. Are the two errors linked in some way? Is there a political faction in Russia publicly advocating for an immediate ceasefire? If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph. Making statements based on opinion; back them up with references or personal experience. What are workers, executors, cores in Spark Standalone cluster? The job will run successfully with this setting (driver memory 2g and executor memory 1g but increasing the driver memory overhead(1g) and the executor memory overhead(1g). And don't go above 5. Proof that When all the sides of two triangles are congruent, the angles of those triangles must also be congruent (Side-Side-Side Congruence). K-means clustering algorithm failing even with appropriate amout of driver memory.why? @OmkarPuttagunta No. If the job requires the driver to participate in the computation, like e.g. Sets with both additive and multiplicative gaps. If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, then the memory needs of the driver will be very low. 512m, 2g). Is the fact that ZFC implies that 1+1=2 an absolute truth? some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit. Apache Spark: Understanding terminology of Driver and Executor Configuration, Identifying a novel about floating islands, dragons, airships and a mysterious machine. = 11g + 1.154g Connect and share knowledge within a single location that is structured and easy to search. = 12.154g. rev2022.7.21.42639. However, I am confused and do not understand completely why it happens and would appreciate if someone can provide me with some guidance and explanation. Why is a different error thrown and the job runs longer (for the second case) between the first and second case with only the executor memory being increased? Cannot Get Optimal Solution with 16 nodes of VRP with Time Windows. = 2.524g. Is it patent infringement to produce patented goods but take no compensation? Site design / logo 2022 Stack Exchange Inc; user contributions licensed under CC BY-SA. Thanks in advance. You can now choose to sort by Trending, which boosts votes that have happened recently, helping to surface more up-to-date answers. /bin/spark-submit --class --master yarn-cluster --driver-memory 7g --executor-memory 3g --num-executors 3 --executor-cores 1 --jars . The memory you need to assign to the driver depends on the job. Spark shell required memory = (Driver Memory + 384 MB) + (Number of executors * (Executor memory + 384 MB)).

Is there some other internal things at work here that I am missing? It is the best practice to go above 1. If I run my program with any driver memory less than 11g, I will get the error below which is the SparkContext being stopped or a similar error which is a method being called on a stopped SparkContext. So I think a few GBs will just be OK for your Driver. Is "Occupation Japan" idiomatic? how to solve java.lang.OutOfMemoryError: Java heap space when train word2vec model in Spark, Values of Spark executor, driver, executor cores, executor memory. Announcing the Stacks Editor Beta release! /bin/spark-submit --class --master yarn-cluster --driver-memory 11g --executor-memory 1g --num-executors 3 --executor-cores 1 --jars . Even if you don't use Spark shared variables explicitly, Spark very likely creates them internally anyway. To subscribe to this RSS feed, copy and paste this URL into your RSS reader. = 2g + 0.524g Any setting with driver memory greater than 10g will lead to the job being able to run successfully. Why dont second unit directors tend to become full-fledged directors? From our experience and from Spark developers recommendation. Another benefit is that Spark's shared variables (accumulators and broadcast variables) will have just one copy per executor, not per task - so switching to multiple tasks per executor is a direct memory saving right there. In a Spark Application, Driver is responsible for task scheduling and Executor is responsible for executing the concrete tasks in your job. Whereas without the overhead configuration, driver memory less than 11g fails but it doesn't make sense from the formula which is why I am confused. = 2 + (driverMemory * 0.07, with minimum of 384m) What are the "disks" seen on the walls of some NASA space shuttles? It seems that just by increasing the memory overhead by a small amount of 1024(1g) it leads to the successful run of the job with driver memory of only 2g and the MEMORY_TOTAL is only 2.524g! /bin/spark-submit --class --master yarn-cluster --driver-memory 7g --executor-memory 1g --num-executors 3 --executor-cores 1 --jars . Both the third and fourth case succeeds and I understand that it is because I am giving more memory which solves the memory problems. Although I find it weird since the executor memory is increased and this error occurs instead of the error in the first case. http://spark.apache.org/docs/latest/programming-guide.html#shared-variables, http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/, Apache-spark Apache Spark: The number of cores vs. the number of executors, Apache-spark How to deal with executor memory and driver memory in Spark, Apache-spark What are workers, executors, cores in Spark Standalone cluster, Memory 20G, 20 VCores per node (3 nodes in total). : I can't find now reference where it was recommended to go above 1 cores per executor. From what I have gathered, this is related to memory not being enough. To learn more, see our tips on writing great answers. I will provide some background information and post my questions and describe the cases that I have experienced after them below. Why does hashing a password result in different hashes, each time? 465), Design patterns for asynchronous API communication. When you reduce the partitioning to 1, that single partition will be in one of the executors. What would the ancient Romans have called Hercules' Club? Now I would like to set executor memory or driver memory for performance tuning. How to deal with executor memory and driver memory in Spark? e.g.

Operations like .collect,.take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data. How can I use parentheses when there are math parentheses inside? The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing. Are shrivelled chilis safe to eat and process into chili flakes? (instead of occupation of Japan, occupied Japan or Occupation-era Japan). Thanks for contributing an answer to Stack Overflow! I'd advise you to find an alternative solution. Is it against the law to sell Bitcoin at a flea market? How we can set the memory and CPU resources limits with spark operators? I am confused about dealing with executor memory and driver memory in Spark. But the idea is that running multiple tasks in the same executor gives you ability to share some common memory regions so it actually saves memory. So, from the formula, I can see that my job requires MEMORY_TOTAL of around 12.154g to run successfully which explains why I need more than 10g for the driver memory setting. From the Spark documentation, the definition for executor memory is.

Find centralized, trusted content and collaborate around the technologies you use most.

that YARN will create a JVM, = 11g + (driverMemory * 0.07, with minimum of 384m) My code recursively filters an RDD to make it smaller (removing examples as part of an algorithm), then does mapToPair and collect to gather the results and save them within a list. By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy. Why increasing the memory overhead (for both driver and executor) allows my job to complete successfully with a lower MEMORY_TOTAL (12.154g vs 2.524g)? /bin/spark-submit --class --master yarn-cluster --driver-memory 2g --executor-memory 1g --conf spark.yarn.executor.memoryOverhead=1024 --conf spark.yarn.driver.memoryOverhead=1024 --num-executors 3 --executor-cores 1 --jars . Here 384 MB is maximum memory (overhead) value that may be utilized by Spark when executing jobs. We use Spark 1.5 and stopped using --executor-cores 1 quite some time ago as it was giving GC problems; it looks also like a Spark bug, because just giving more memory wasn't helping as much as just switching to having more tasks per container. Any help will be appreciated and would really help with my understanding of Spark. Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. As you don't know which one, each one of your executors will need to have >> 20Gb. How did this note help previous owner of this old film camera? If you are familiar with MapReduce, your map tasks & reduce tasks are all executed in Executor(in Spark, they are called ShuffleMapTasks & ResultTasks), and also, whatever RDD you want to cache is also in executor's JVM's heap & disk. Apache Spark: The number of cores vs. the number of executors, Number of workers in SPARK standalone cluster mode, Apache Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs. Few 100's of MB will do. How APIs can take the pain out of legacy system headaches (Ep. I am doing some memory tuning on my Spark job on YARN and I notice different settings would give different results and affect the outcome of the Spark job run. Is moderated livestock grazing an effective countermeasure for desertification? E.g. Asking for help, clarification, or responding to other answers. However, in the third case, spark.driver.memory + spark.yarn.driver.memoryOverhead = the memory http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ I guess tasks in the same executor may peak its memory consumption at different times, so you don't waste/don't have to overprovision memory just to make it work.