Monday, June 3, 2019
Novel Clockwise Task Migration in Many-Core Chip
Novel dextral Task Migration in M distributively-Core ChipA Novel Clockwise Task Migration in Many-Core Chip Multi mainframesAbstract-The labor trend for Chip Multiprocessors (CMPs) moves from multi- shopping centre to many- meaning to obtain steeper computing death penalty, flexibility, and scalability systems. Moreover, the transistors size is constantly shrinking, and more and more transistors are interconnected in a maven minute that bring home the bacons to design more force-outful and complicated systems. However, obtaining higher computing performance needs to increase the consuming of power outlay which results in increasing the on- cow dung impetuousspots and the boilers suit chip temperature. The peak temperature ca physical exercises performance degradation, reducing reliability, decreasing the chip life spam, and eventually, damaging the system. Therefore, Runtime thermic Management (RTM) for CMPs has become crucial to minimize temperature with extinct any p erformance degradation. In this paper, a new dextrorotatory labour migration proficiency is proposed on many- center of attention group CMPs. The proposed technique migrates the weighted saddleed tasks which are position in a primeval cores away from the of import cores to the contact cores. The proposed technique performs a right-handed task migrations to distribute the variations risquespots that are placed in the central core of the chip. Moreover, the proposed migration algorithm gathers cores temperature by apply performance-counters and proposed equations which shows competent results instead of using thermic sensors. Simulation results indicate up to 15% reduction in the maximum temperature value of the whole many-core CMPs. The energy of the proposed technique is shown by temperature values of many-core CMPs that are on a lower floor the maximum temperature limit.Keywords- chip multiprocessors many-core task migration performance counter runtime thermal manage ment.The chip multiprocessors (CMPs) is continued to increase the quash of transistors to face the increased demand of the retaining reliability and high computing performance. In the same time, transistors size are constantly shrinking, and more and more transistors are integrated in a single chip that allows to design more powerful and complicated CMPs architectures 1. These advantages lead to increase cores number on the CMPs, therefore CMPs are unsteady from multicore to many-core era where tens or hundreds of cores are integrated on a single chip connected via net ca-ca-on-chip (NoC) 4-5. In fact, many-core CMPs provide higher computing performance because of execution of instrument heavy loaded tasks which consume more power consumption. However, heavy loaded tasks lead to increase the overall chip temperature and on-chip hotspots. Hotspots are the main driving bar for wide adoption of many core CMPs architectures which lead to performance degradation, reduced reliability , increased cooling costs, shorter chip life span, and eventually the system frailer. Therefore, to achieve rectify computing performance with higher scalability and maintaining reliability, businesslike Runtime Thermal Management (RTM) techniques become very imperative 3,6-8.In fact, RTM not only aims to balance and distribute the temperature of the chip but also enables many-core CMPs to operate at a favorable performance enchantment working below a temperature threshold 1-2. Therefore, in order to maintain efficient performance on the many core CMPs, authors propose a clockwise task migration technique that is served as an alternative to control the many core CMPs cores temperature. The proposed migration technique migrates the heavy loaded tasks which are placed in the central cores away from the central berth to the surrounding actuate on the core layer. In other word, the proposed method performs the clockwise task migrations to distribute the variations hotspots that are placed in the central cores of the chip. The proposed method aims to maximize the throughput on many core CMPs while satisfying the peak temperature constraint 5-6,9.With the development of many-core CMPs, using high overhead expensive thermal sensors to measure cores temperature becomes not effective nor improper to encounter thermal challenges 3,12. Therefore, in this work, a new technique open been provided to measure cores temperature instead of using thermal sensors. The proposed migration algorithm obtains the core temperature by using performance-counters which are placed in each core. In this context, cores with high temperature are distributed on the chip without any performance degradation 1-3,11-13. In this paper, they are some contributions are achieved as followingIt develops a novel runtime task migration technique in many-core systems to balance hotspots.Instead of using high overheads expensive sensors to majeure cores temperature, the proposed task migration techn ique is using performance-counters. data-based results show that the proposed algorithm can signicantly outperform the conventional approach.The rest of the paper is organized as follows. First of all in Section II, a heavyset of related works is given. The proposed technique is introduced in Section III. In Section IV, experimental evaluation is presented. Finally, the conclusion is given in Section V.While the intentness trends of CMPs is to increase transistors numbers redundant exponentially as Ohms low, its help to achieve more powerful and better computing performance by executing heavy loaded tasks 1-3. However, heavy loaded tasks lead to increase on-chip thermal hotspots and the overall CMPs peak temperature. Thus, in case of having hundreds of processors are integrated on a single chip as many-core CMPs, off-line methods are not efficient. Therefore, RTM becomes crucial to balance on-chip thermal hot-spots and the overall CMPs peak temperature 1-3,8-10. To this end, many theoretical works have been carried out to dissipation and elimination thermal hot-spots by different techniques. For instance, Dynamic Voltage and Frequency Scaling (DVFS) technique in 7 aims to control the temperature by dynamically adjusting the processor speed base on the workload. However, DVFS techniques dynamically adjusting the processor speed based on the workload which sacrice the performance to cool down the chip temperature. some other(prenominal) technique called task migration technique which aims to manage the on-chip temperature by balancing the tasks loads among CMPs tiles without slowing down the processing. In 1-3,8,10-11 the proposed algorithms in some cases is unable to find a proper destination core due to the thermal constraints, therefore, authors have utilize DVFS which had proved to be inefficient as far as performance is concerned. In 2, authors had implemented many thermal-aware algorithms to migrate tasks between processor cores to reduce thermal vari ation in 3D architecture with stacked DRAM memory. However, the authors are used some techniques that proceed static task migration which in some cases can migrate a task from cold core to a hotspot core. Also, the authors proposed another techniques which are providing high overheads expensive thermal sensors to detect the on-chip hotspot. Moreover, in 2-3, authors proposed other techniques which always assigns the new job to the coolest core for balancing the thermal hotspots across the chip, except it increases hotspots in the system rapidly. Therefore, in case of having hundreds of processors are integrated on a single chip as many-core CMPs, off-line methods are not efficient to distribute and balance the thermal hotspots. In this work, a novel runtime task migration technique is proposed which offers an effective solution to face thermal challenges in many-core CMPs. Furthermore, instead of using high overhead expensive sensors to measure cores temperature, the proposed migra tion technique is using performance-counters to measure many-core CMPs tiles temperature.Fig. 1 Many-core CMPs with 64 cores and the TCU connection with a tile on many core CMPs.Fig. 2 A tile components in 64 cores many-core CMPs.Nowadays, the CMPs industry trend moves from multi-core to many-core architectures to achieve better computing performance, and more maintaining reliability. Therefore, many-core CMPs architectures provide heavy loaded tasks to allow the system operating at high computing performance. However, heavy tasks lead to increase peak temperature of chip and on-chip hotspots. Thus, RTM is crucial to achieve balanced systems temperature threshold with efficient task execution performance.As shown in signifier 1, a many-core CMPs with 64 tiles is presented. Each tile includes a core, a private L1 cache bank, and a overlap cache L2 bank as shown in Figure 2. The proposed technique in this work aims to balance thermal distribution to combat thermal issues and tempera ture related reliability. The proposed technique provides task migration between cores while it is done at runtime and repeated periodically at a predefined time interval. Each time interval in this work is 100ms. Each core considers instruction per cycle (IPC) for calculating power consumption at the end of each interval. IPC is a critical factor in power consumption calculation. It is notable that, cores with higher power consumption lead to execute tasks with higher performance which create higher temperature in compared with the cores with lower power consumption 8. The power consumption for each core is calculated based on Equation 1.Where P is the core power consumption, IPC is the instruction per cycle which is the core activity, f is the core frequency, CL is the ordinary capacitance, and VDD is supply voltage. Since the frequency of each core in the many-core CMPs is constant and the DVFS technique is expensive and inappropriate because of performance degradation, dynamica lly salmagundi in the frequency of each core is not faux in the system. As can be seen in Equation 1, the IPC has a key map for calculating and predicting the power consumption of each core in system. For calculating IPC, performance counters are used which are very applicable in the new(a) processors. Each core has a performance counter for IPC counting. At the end of each time interval, IPC is achieved by the performance counter for each core and so power consumption is calculated based on Equation 1. According to the calculated power consumption, a look up table in the Thermal Control Unit (TCU) will be filled. An example of look up table is illustrated in Figure 3. In the target many core system, the TCU is assumed to be placed near to all of the cores as shown in Figure 1. Based on the filled table in the TCU, we divide the many core floor plan into two farewells, the central part with one region, and the surrounding part with four regions as shown in Figure 4. Based on t he thermal distribution of central part and surrounding part, we try to balance the temperature in the system. As before mentioned, the look up table is illustrated in Figure 3, based on each core activity, hot and cold cores are determined based on the related thresholds shown in Figure 5 ,where th1=5, th2=10, th3=15, and th4=20.Fig. 3 A sample of a look up table in the PCU used at the end of each time interval.Fig. 4 The central part and the surrounding part of 64 tile of many core CMPs.Based on the plan of hot and cold cores, the proposed technique sorts the cores both in the central part and surrounding part from the hottest to coldest cores. Then the proposed technique exchanges the hottest core in the central part with the coldest core in the surrounding part. Based on this trend, the heavy load tasks are migrated to the edges of the chip and light load tasks are migrated to the central part. It is notable that the edges of the chip is a better choice for placement of the hot cores in compared with the central part because neighbor cores have a big effect on each temperature. Since the number of cores in the surrounding part is three times of the central part, the hot cores in the central part have more options for migration with a cold core. At the end of each time interval, each core sends IPC information (cores activity) which calculated based on performance counter to the TCU. Then, the TCU based on cores activities from the look up table calculates two sets of activities which are in central part and surrounding part. Therefore, the TCU sorts the activities related to central part and surrounding part from the hottest to the coldest cores, separately. In this part, as shown in Figure 1, TCU exchanges the hottest core in the central part with the coldest core in surrounding part region by region as will be explained in the next subsection. It is notable that the TCU can migrate the hot cores in the central part with the cold cores in the surrounding part in the clockwise manner.Fig.5 The used thresholds for determining the ranges of temperature of the cores.Fig. 6 The proposed clockwise task migration algorithm.A. Clockwise Migration AlgorithmFor avoiding the gathering of all of the hot cores in a one region of surrounding part instead of divide it the whole surrounding part regions, a novel clockwise algorithm is proposed. This clockwise migration algorithm divides the surrounding part into four regions as shown in Figure 4. After sorting the cores from high temperature to low temperature both in of central part and surrounding part by the TCU, the proposed clockwise algorithm exchanges the hottest core in the central part with a coldest core in the surrounding part region one. After that, the proposed clockwise algorithm exchanges the hottest core in the central part with a coldest core in the surrounding part region two etc. The system repeats this procedure periodically at the end of each time interval to migrate the hot co res in the central part with the cold cores on four regions in surrounding part. The summary of Phase 1 and Phase 2 of the proposed clockwise task migration technique is shown in Figures 6.As shows in Figure 1, a 64 tiles many-core CMPs architecture with multithreaded workloads is used to proceed the proposed clockwise task migration technique.a) Platform SetupIn order to validate the efficiency the many-core CMPs architecture in this paper, authors use the traffic traces extracted from GEM5 15 full-system simulator to setup the basic system platform. The area of cores and cache banks are estimated by CACTI 21 and McPAT 20. We use multithread applications from PARSEC benchmarks 14 in our experimental evaluation. The detailed system configuration are given in Table 1. For this benchmarks, one billion instructions are penalise for the simlarge input set starting from the Region of Interest (ROI). HotSpot 17 version 5.0 is employed as a grid-based thermal modeling tool for chip temper ature estimation. For experimental evaluation, maximum temperature limit and dark silicon peak power budget, Tmax and Pbudget is assumed to be 80 and 100 W, respectively.Table 1. Specification of the target CMP architecture.Component commentNumber of Cores64, 8-8 meshCore ConfigurationAlpha21164, 3GHz, 65nmPrivate Cache per each CoreSRAM, 4 way, 32 line, size 32KB per coreOn-chip MemoryBaseline nonmoving random mappingProposed Proposed migration techniqueb) Experimental ResultsIn this sub-section, we evaluate a many core CMPs in two different cases. First, the many core CMPs without any migration policy (Baseline), and the many core CMPs with the proposed clockwise migration policy (Proposed).Figure 7 shows the results of normalized throughput for PARSEC and SPEC workloads, where throughput is the number of executed instructions per second (IPS). As shown in Figure 7, the Proposed architecture yields on average 31% throughput improvement compared with the Baseline. Moreover, Figure 8 illustrates the results of normalized energy consumption for PARSEC and SPEC workloads. As shown in Figure 8, the Proposed architecture yields on average 69% energy consumption improvement compared with the Baseline. In addition, Figure 9 (a) and (b) show the results of temperature distribution for canneal from PARSEC workloads for Baseline and Proposed architecture, respectively.Also, as shown in figure 9 (a), after applying the proposed clockwise task migration technique (Proposed), it ensures that all cores on the many core CMPs are below the maximum temperature of 80 . While the Baseline spends up to 19% of time above the maximum temperature which presences hotspots as shown in figure 9 (b). In other words, by applying the proposed clockwise task migration technique on the proposed many core CMPs architecture, it distributes the temperature and without appearance of hotspots.Fig.7. Comparison results of IPC.Fig.8. Comparison results of energy consumption.The many-core CMPs pr ovide higher system performance, more flexibility and scalability. Since these advantages crave increased power consumption in the system, peak temperature issues become disquieting. Thus, Runtime Thermal Management (RTM) of many-core CMPs becomes crucial in minimizing thermal hotspots without any performance degradation. In this paper, the proposed clockwise task migration technique migrates the heavy loaded task from central cores part to the surrounding cores part. Thy system gathers cores temperature by using performance-counters that are placed in each core instead of use thermal sensors. Since cores with higher power consumption lead to execute higher tasks performance, therefore creates higher temperature. Experimental results of the 64 tiles many-core CMPs have shown signicant improvement of the average for normalized IPC throughput and energy consumption. While the many-core CMPs architecture yields on average 31% throughput improvement compared without preceding the using technique. Moreover, the Proposed architecture yields on average 69% energy consumption improvement compared without using the proposed technique. Furthermore, results also have clarified that up to 15% signicant reduction of temperature threshold, and all tiles are below the maximum temperature limit which is 80 on the 64 tiles many-core CMPs(a)(b)Fig.9. Comparison results of temperature.