HPC Application optimization and code creation for PC,ARM,Windows,Linux & Android

Message boards : Number crunching : HPC Application optimization and code creation for PC,ARM,Windows,Linux & Android

Author	Message
QuantumEthos Send message Joined: 11 Nov 17 Posts: 2 Credit: 488,640 RAC: 0	Message 5592 - Posted: 16 Dec 2017, 19:35:11 UTC boinc - enhancing research workloads for the benefit of mankind & humanity - Computer Optimisation - CPU , GPU & RAM - PC, Mac & ARM development HPC - High Performance Computation for beneficial goals and obvious worth. (Guide, experimentation, developer kit's and manuals) by (c) Rupert Summerskill 何百万のコアで何をするのですか？混乱した毛穴から血が流出するまで、混乱の罠から惑星を救いなさい。永遠の海のイルカのような時間の川で踊りましょう。夕方の海岸まで科学の蝋燭をちらつかせる。 What would we do with a million cores!? Save a planet from the grip of chaos death till the blood runs from shattered pores.. Dance in the rivers of time like the dolphins in the seas of evermore.. Flicker the candle of science till evening shores. * Observing the workloads of many beneficial projects we find that commonly the workload data set is small, In addition to the memory set being smaller or larger than a machine can compute optimally; we find that feature sets such as fae and avx have commonly not been implemented, Some projects like asteroids at home and the seti project are using enhanced computation instruction sets ... like avx and memory loads that benefit from the 4gb or more ram that is available on decent gaming and home laptops. Not all modern machines have loads of ram; However research and or university establishments use sufficiently powerful machines that can glow on the boinc record in full glory with a 256mb to 768mb workload, In addition the machines are operand,xen ... commonly and servers may have such as Sparc or power pc specific hardware and instruction sets, In order to examine examples .. below we can see workloads include small data arrays; in the 40mb to 79mb range.. In line with servers and gaming rigs .. we have 1gb of ram per core, of course not all issues require a larger array in the workload and some machines have 256mb per core ! However much Ram you allocate to the projected workload; small memory loads can and will be sufficient for data swapping and or paging (like DNA Replicators)... Some task can sufficiently benefit from larger thread and data models, to my mind DNA and mapping data are fine examples of specific workloads; Where memory counts, In addition thread count can be 4 or other numbers and i suggest that a single task can use more than one core and instruction set (neon for example or Symmetric threading FPU, SMT) Specific workload optimisation, or rather generic with SSE and AVX and FPU threading and precision optimisation would be very cool while we deal with the workload running app In particular the Ryzen multi-core is a new and exciting product, So take care to read the guides in the lower half of the document, AVX2, RDSEED, ADX and additional encryption formats are some of the most exciting changes to the AMD Ryzen Arch. AVX similarities to GPU core, Function of AVX can be thought of as CPU extension function of the same usage as GPU! In short combined with FPU very much in the same performance category as the GPU cores and of much worth to scientific research and development of game dynamics, sound, video and spaces in N-Dimension space. CPU extensions can prepare vector space for GPU to enhance the speed and optimize vector tables before GPU rendering and sound space in 3D for surround sound... Interpolate texture, sound and other data with bit swapping.. In SIMD instructions. RND Function can be used to explore additional data spaces. Encryption function to enhance unpredictable behavior or to save space. Further thought ... Efficiency : add a MHz/Dhrystone's/MIP'S performance per watt to each system ... then projects will further optimise workloads to improve upon workload energy & environmental efficiency versus work carried out. Work Hours x Mhz / (efficiency per watt) ------- Hours / % of projects finished with work completed Also bear in mind that GPU's need watt efficiency and task management to optimise power used versus work done.... worker priority should always be : efficiency + merit of the work -------- time / % necessity Please examine the issue further. Rupert S https://www.worldcommunitygrid.org https://boinc.berkeley.edu/ http://www.charityengine.com/ http://esa-space.blogspot.com/ http://bit.ly/HPC-Dev - examination and findings direction of HPC Development http://bit.ly/tRNG-Dev - will Random/Entropy drivers help - function examined and processed. work photos : HPC Computing work load Photos http://bit.ly/HPCImpact http://bit.ly/2HPCImpact http://bit.ly/HPCCluster3 http://bit.ly/BoincStudies http://bit.ly/ReserchPhotos http://esa-space.blogspot.ru/2017/04/rng-and-random-web.html - we need Chaos Seeds : Random seeds for our work https://www.youtube.com/watch?v=mLQGXlxemlg - Optimizing HPC Service Delivery by a life time super computing tec https://youtu.be/KbjFGQ9fHvw - Scaling and Optimizing Climate and Weather Forecasting Programs on Sunway TaihuLight - very exciting https://insidehpc.com/2017/06/video-scaling-climate-weather-forecasting-sunway-taihulight/ HPC Best Practices.. http://www.intertwine-project.eu/best-practice-guides AMD Platform Optimization - please read for all developers https://community.amd.com/thread/213045 - particular instruction differences for microcode optimisation http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-Optimizing-For-AMD-Ryzen.pdf - code optimisation a few very important lessons... may seem simple to some but obviously is not to be taken for granted. http://support.amd.com/TechDocs/24593.pdf - AMD64 Architecture Programmer’s Manual Volume 2: System Programming CPU Optimisation - utility and function. http://gpuopen.com/compute-product/codexl/ - CodeXL is a code efficiency analyser optimiser debugger for GPU and CPU and system. https://github.com/GPUOpen-Tools/CodeXL/releases/latest http://bit.ly/CoXLPhoto - CodeXL in action photos http://www.guru3d.com/files-details/siv-4-45-download.html SIV system information viewer & setup http://www.noamross.net/blog/2013/4/25/faster-talk.html - speeding up code a guide - profiling and bench-marking. http://www.pgroup.com/doc/pgi17ug-x64.pdf - PGI Compiler guide http://www.agner.org/optimize/ - code optimisation for all programmers on X86,X86-64bit and some others.. this is a terrific resource ! http://www.agner.org https://github.com/ctuning/ck - data & program - testing and tuning 25/06/2017 11:36:51 \| \| OpenCL: AMD/ATI GPU 0: AMD Radeon R9 200 Series (driver version 2348.4, device version OpenCL 1.2 AMD-APP (2348.4), 3072MB, 3072MB available, 4178 GFLOPS peak) 25/06/2017 11:36:51 \| \| Host name: NKBlueCube 25/06/2017 11:36:51 \| \| Processor: 8 AuthenticAMD AMD FX-8320E Eight-Core Processor [Family 21 Model 2 Stepping 0] 25/06/2017 11:36:51 \| \| Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 popcnt aes f16c syscall nx lm avx svm sse4a osvw ibs xop skinit wdt lwp fma4 tce tbm topx page1gb rdtscp bmi1 for example : Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 popcnt aes f16c syscall nx lm avx sse4a osvw xop wdt fma4 topx page1gb rdtscp bmi1 or for example : Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 popcnt aes f16c syscall nx lm avx svm sse4a osvw ibs xop skinit wdt lwp fma4 tce tbm topx page1gb rdtscp bmi1 for an improved upon instruction list in the newer boinc application.. (with appropriate configuration) 11000 Mips & 2700 FPU Mips - per Core an article that took some deep learning... itself ôo, anyway very interesting.... hip c++ will we think be simpler than open CL then as a higher level code port... and machine converted CUDA-code to 99.6% http://www.anandtech.com/show/10831/amd-sc16-rocm-13-released-boltzmann-realized Compilers and Make compliant with SMT and other HPC Standards https://cmake.org/ http://llvm.org/ http://llvm.org/docs/FAQ.html https://gcc.gnu.org/ https://cygwin.com/index.html not free obviously .. intel https://software.intel.com/en-us/articles/intel-advisor-roofline compilers with FORTRAN specifics and preferably C/C++ and HPC (compatibility C++/C compatible with FORTRAN preferably) https://gcc.gnu.org/wiki/HomePage https://gcc.gnu.org/wiki/GFortranBinaries https://software.intel.com/en-us/intel-parallel-studio-xe/try-buy/#parallelstudioxe http://www.pgroup.com/products/pgiworkstationg.htm (limitations nVidia compatable GPU Cuda code & no obvious statment of OpenCL Support) http://llvm.org/ - llvg it seems has fortran compatibility.. (needs research) http://llvm.org/docs/FAQ.html http://www.pathscale.com/ - check it out Fortrans Speacialists (no c++ etcetera) https://www.absoft.com/products/windows-fortran-compiler-suite/ http://www.fortran.com/products-page/compilers/fortrantools-for-windows/ https://www.cs.sfu.ca/~fedorova/Teaching/CMPT886/Spring2007/papers/adaptive-execution.pdf ibm guidance* http://www.prace-ri.eu/best-practice-guide-ibm-power-775-html/ https://www.redbooks.ibm.com/redbooks/pdfs/sg248280.pdf ** PC/Mac/Windows/Linux/Android - high performance computation - the method and the means https://www.khronos.org/news/events/2016-isc-high-performance https://www.khronos.org/assets/uploads/developers/library/2008_siggraph_bof_opengl/OpenCL%20and%20OpenGL%20SIGGRAPH%20BOF%20Aug08.pdf HPC Report http://www.ziti.uni-heidelberg.de/ziti/uploads/ce_group/2017-ISC.pdf - Overview of MPI message characteristics of HPC Server proxy applications. Interesting statistics from which one can conclude that 64 to 256 core units is the space within which, The maximum increase in message noise/entropic noise; Related to inter process communication is observed. https://www.microsoft.com/en-us/download/details.aspx?id=54507 Microsoft HPC Pack 2016 including linux https://technet.microsoft.com/en-us/library/cc514029(v=ws.11).aspx all HPC Packs 2016,2012 to 2008 info and download https://msdn.microsoft.com/en-us/library/ff976568.aspx Microsoft High Performance Computing for Developers - info and downloads https://docs.microsoft.com/en-us/azure/virtual-machines/windows/hpcpack-cluster-active-directory - information and virtualisation https://www.openfabrics.org/ https://centers.hpc.mil/users/tools.html https://centers.hpc.mil/users/COSTQuickRef.html https://centers.hpc.mil/software/ https://openhpc.community/downloads/ http://www.cray.com/blog/getting-new-intel-xeon-scalable-processors-hpc-workloads/ - details about intel arch in HPC workloads. OpenVX for high performance Computing : Multi platform spec "OpenVX for HPC Neural Nets and processing .... a new way to deliver on research, gaming & processing of data and images" https://www.khronos.org/news/tags/tag/OpenVX https://www.khronos.org/news/press/openvx-1.2-specification-cross-platform-acceleration-power-efficient-vision Open CL "GPU Development" links https://www.khronos.org/blog/iwocl-where-you-learn-the-latest-on-opencl https://www.khronos.org/opencl/ https://www.khronos.org/opencl/resources for SDK, learning & optimisation resources. http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/ https://github.com/RadeonOpenCompute - ROCm: Platform for GPU Enabled HPC and UltraScale Computing http://gpuopen.com/professional-compute/ http://gpuopen.com/compute-product/hcrng/ https://bitbucket.org/multicoreware/hcrng http://gpuopen.com/compute-product/clrng/ installing the AMD SDK improves compute performance, Optimise your code ! https://streamhpc.com/blog/2017-05-21/amd-open-sourced-rocms-opencl-driver-stack/ https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/amd-master/README.md http://developer.amd.com/tools-and-sdks/opencl-zone/ http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/ http://gpuopen.com/games-cgi/ http://developer.amd.com/tools-and-sdks/graphics-development/ http://hgpu.org information; interesting learning & source http://dspace.princeton.edu/jspui/bitstream/88435/dsp01wm117r22g/1/Jia_princeton_0181D_11168.pdf Optimisation for parallel computing information. https://arxiv.org/pdf/1705.05249 - CLBlast: A Tuned OpenCL BLAS Library demonstration. https://indico.cern.ch/event/506317/contributions/2017945/attachments/1241758/1826458/SixTrackGPU.pdf https://lhcathome.cern.ch/lhcathome/index.php - coders needed. https://arxiv.org/pdf/1710.08616 https://arxiv.org/pdf/1710.08616.pdf - FORTRAN for GPU and multiprocessor usage in Scientific research, Also of interest in the generation of coding Format, style, implementation & Structure. "The new implementation performs up to 4.9x faster when comparing one GPU to one multi-core CPU socket. On a full-scale production run with 1581 x 1301 x 58 grid size and 2km resolution, 24 Tesla P100 GPUs are shown to replace more than 50 18-core Broadwell Xeon sockets." "GPUs are an attractive target architecture, with a memory bandwidth that is typically 5 to 7 times higher than Intel Xeon architectures of a similar generation." "Compared to CPUs, GPUs support a very high number of parallel threads while having a very low thread switching overhead - however with the cost of small caches available per thread and a low single-threaded performance." HIP - HSA - the CUDA Compatible C++ for Heterogeneous Computing http://developer.amd.com/wordpress/media/2012/09/7637-HIP-Datasheet-V1_4-US-Letter.pdf http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf - a full guide http://www.hsafoundation.com/ http://www.hsafoundation.com/hsa-developer-tools/ https://github.com/HSAFoundation/HSA-docs-AMD/wiki#initial-implementation https://github.com/HSAFoundation/HSAIL-Tools https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver - Driver for kernel http://www.amd.com/Documents/SDN-Whitepaper.pdf - Smart Software Defined Networks http://support.amd.com/TechDocs/55766_SEV-KM%20API_Spec.pdf - Secure Encrypted Virtualization Key Management http://support.amd.com/TechDocs/Protecting%20VM%20Register%20State%20with%20SEV-ES.pdf - PROTECTING VM REGISTER STATE WITH SEV-ES http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf - bios and kernel drivers Machine Intelligence code optimization platforms https://www.tensorflow.org/ - machine intelligence https://github.com/tensorflow/tensorflow https://github.com/hughperkins/tf-coriander - openCL Tensor flow PyTorch - Machine learning with graphs, Tesor philosophie and python - https://github.com/pytorch/pytorch - http://pytorch.org Hyperdash python SDK - PyTorch https://github.com/hyperdashio/hyperdash-sdk-py Richard Herbert real time learning with PyTorch - Real-time Machine Learning with PyTorch and Filestack https://blog.filestack.com/tutorials/realtime-machine-learning-pytorch/ "Kirill DubovikovFollow - Knowledge distiller, Data Scientist and Software Architect" https://medium.com/towards-data-science/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b speed and data comparison https://medium.com/@yaroslavvb/tensorflow-meets-pytorch-with-eager-mode-714cce161e6c ARM Development software/SDK's & tools - HPC https://developer.arm.com/products/software-development-tools https://developer.arm.com/products/software-development-tools/hpc for high performance computing (ideal for Boinc) https://developer.arm.com/products/software-development-tools/compilers for both HPC and APP development. https://developer.arm.com/products/system-design/fixed-virtual-platforms https://www.synopsys.com/verification/virtual-prototyping/vdk/vdk-for-arm.html https://www.synopsys.com/designware-ip/technical-bulletin/designware-hybrid-ip.html ARM Feature Sets https://www.arm.com/products/processors/instruction-set-architectures/index.php https://www.arm.com/products/processors/armv8-architecture.php IOT links - (internet of things) https://www.infoq.com/articles/thread-protocol-for-home-automation http://wso2.com/wso2_resources/wso2_whitepaper_a-reference-architecture-for-the-internet-of-things.pdf compiler optimization - process https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/ https://www.nextplatform.com/2017/05/25/nersc-supercomputing-site-eases-path-optimization-scale/ https://www-ssl.intel.com/content/www/us/en/events/hpcdevcon/parallel-programming-track.html#utilizing Linux arch reference material https://www.ibm.com/developerworks/library/l-linuxuniversal/ Agency GPL https://code.nasa.gov/ Workers : https://www.upwork.com/hire/driver-development-freelancers/ http://www.wcgsig.com/342585.gif Update 2: for a comparison of Gflops/Mips throughput of various Boinc Tasks .. here we show the relevance of the code or function used ... AVX for example is multi threaded ! and so is the FPU pipeline of the AMD FX & Ryzen processor..... http://bit.ly/HPCImpact (original non edited photos ...) and set 2 (newer) http://bit.ly/2HPCImpact .... Some of our work with the updated graphics http://bit.ly/ReserchPhotos see the work throughput GFlops compared to code efficiency per task ! sometimes entropy is needed to for-fill the task one would imagine (for example on android) http://bit.ly/tRNG-Dev the improvement of the boinc and worldcommunitygrid projects has been observed, noted and one feels improved upon, .. further improvement should be implemented as soon as possible; To improve work versus output efficiency. thank you kindly programmers/Workers & scientists for your perseverance & effort. RS ** Update 3 Q & A: "In reference to the use of virtual box there is a new product by berkley > http://singularity.lbl.gov/ called singularity that handles repeatable condition containers... and has low overhead for virtualisation data-set. As to the particle spread one should possibly consider the multiple core and threaded core model specific to the Ryzen and intel sets... One could imagine that the multi-threaded nature of arm server cores combined with the nature of multi-threaded and headed arm CPU's and GPU Run-script environments is a new and uncompromising land of opportunity and challenge. Many of the instructions on the FMV4 and Vector instruction sets have multi-threaded en-action at lower precision..." http://fife.fnal.gov/singularity-on-the-osg/ RS ---- Eric Mcintosh accredited scientist Cern Project administrator Project developer Project tester Project scientist "Well we are far from trying to optimise GPU code. First let me explain that we have a tracking loop over turns (up to 1,000,000 hoping for 10,000,000 soon) which contains a large number of inner loops over particles, currently up to 64. Luckily these loops over particles can be paralleled as each particle is totally independent. In addition the original author F. Schmidt pre-calculated everything possible before entering the tracking loop. Each turn involves some 10,000 steps over a varying number of inner loops, e.g. straight section, quadruple, beam-beam interaction, power supply ripple, etc etc Of which there are about 50 different possibilities. A straight section is really just a multiply and add, whereas beam beam involves hundreds or more FLOP's. The first idea would be to use a much larger number of particles to best utilise the GPU. This however would produce a large amount of I/O and use a lot of disk space, but maybe not insurmountable, However all the code is FORTRAN, the outer loop calls subroutines (could inline), and has many tests/branches. It would be great if the main loop fitted entirely into the GPU and we would have rare Host access for I/O or BOINC checkpoint and progress calls or when one or more particles are lost. My colleague Ricardo is actively looking at redoing in C which would also allow much more portability and also allow to be parallel on multi-core systems. For the moment we just run tasks in parallel, which works rather well (apart from some current infrastructure problems). I hope to come up with some numbers next week on GPU testing. The code itself has been regularly measured and optimised; for example we re-ordered array indices to optimise memory access and rewrote the Error Function of a Complex Number to be faster but with adequate precision. Portability does come at a price but ensures accuracy of results. I shall publish measurements in an upcoming paper. I am sure we gain much more from being portable and being able to use almost any IEEE 754 compliant processor. On the issue of SixTrack and/or experiments this will shortly be under discussion at CERN I am sure. Currently SixTrack has many more Hosts/volunteers, is simple to install, and has been around for 13 years. Not everyone loves VMbox. Not a big deal at present as we rarely have enough SixTrack work to keep all volunteers busy. I hope to re-address all this in some weeks after current BOINC infrastructure issues are resolved and we have the new "super" sixtrack with much broader application e.g.collimation studies and we support a much wider range of platforms MacOS ARM and use features such as AVX. Eric. ____________" Update 4 : Virtualisation QEMU is obviously be of use on many projects because of machine emulation and virtualisation.. Comes in flavours including Windows, Mac and Linux. http://www.qemu.org/ https://www.vmware.com/try-vmware.html - free products at the bottom https://www.vmware.com/go/downloadplayer https://www.vmware.com/go/get-free-esxi * Docker Sever & Docker CE (community edition) and this comes with sever edition! (QEMU Based Containers) So what do the projects & system.. feel and sense around the subject of using Docker CE ? Obviously the professional version could be used for support of the main project and the CE edition or pro for the user.. https://store.docker.com/editions/community/docker-ce-desktop-windows https://store.docker.com/search?offering=community&q=&type=edition https://www.ctl.io/developers/blog/post/what-is-docker-and-when-to-use-it/ https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-getting-started https://www.howtoforge.com/tutorial/how-to-use-docker-introduction/ ** how to convert VM's and use hyper V and Docker https://www.virtualbox.org/manual/ch10.html - compatibility https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/enable-hyper-v https://www.groovypost.com/howto/migrate-virtual-box-vms-windows-10-hyper-v/ https://hyperv.veeam.com/blog/nested-vitualization-hyperv/ https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/nested-virtualization https://superuser.com/questions/1144405/enable-virtualization-for-windows-10-pro-running-inside-virtualbox Update 5 : IO Bottlenecks and solutions. http://blog.scoutapp.com/articles/2011/02/10/understanding-disk-i-o-when-should-you-be-worried http://www.violin-memory.com/blog/understanding-io-random-vs-sequential/ Drive Cache : even a 128mb of cache does do wonders for #DataScience #storage we use a 2gb http://www.romexsoftware.com/en-us/primo-cache/index.html #Cache to the #Drive 300mb/s http://bit.ly/BoincStudies - Result Studies https://browser.geekbench.com/v4/compute/743093 GPU Function https://browser.geekbench.com/v4/cpu/2831836 CPU Function http://www.anandtech.com/show/11523/qnap-launches-ts1277-nas-with-amd-ryzen-cpus ID: 5592 · Rating: 0 · rate: / Reply Quote

Message boards : Number crunching : HPC Application optimization and code creation for PC,ARM,Windows,Linux & Android