понедельник, 12 октября 2009 г.

GPU: GT300 GPU on Fermi technology

Ситуация: NVIDIA выпускает чипы
- 2006 - GT80
- 2008 - GT200
- 2009 - GT300
---
GT300 - выпущен по архитектуре Fermi. Пока по этой архитектуре всего один чип.
Его данные:
* 3.000.000.000 транзисторов
* 512 CUDA processing cores that are organized into 16 streaming multiprocessors of 32 cores each
* The memory system can support up to 6GB of memory
* The GPU also supports the IEEE 754-2008 standard (floating point arithmetics)
---
CUDA - это buzzword от NVIDIA, она так, похоже, все свое называет

Тут статья(20 страниц) Looking Beyond Graphics. Выдержки:

* Originally, NVIDIA used the term Compute Unified Device Architecture (CUDA) to describe its GPUcomputing platform. Later, CUDA became a blanket term for NVIDIA’s GPU architecture, run-time platform, and software-development tools. Now NVIDIA is raising the bar.

* NVIDIA’s next-generation CUDA architecture, code-named Fermi.

* Fermi is a significant advance over previous GPU architectures. NVIDIA’s first architecture expressly modified for GPU computing was the G80, introduced in November 2006 with the GeForce 8800. The G80 departed from the traditional approach of using dedicated shaders, separate vertex/pixel pipelines, manually managed vector registers, and other features that were efficient for graphics but unwieldy for general-purpose computing. The G80 was also NVIDIA’s first GPU architecture programmable in C — an essential feature for compute applications.

* G80-class GPUs had 128 programmable shaders, later called streaming processor cores and now known simply as CUDA cores. Eight CUDA cores were grouped into a “streaming multiprocessor” whose cores shared common resources, such as local memory, register files, load/store units, and thread schedulers. G80-class chips had 16 of these streaming multiprocessors. (16 SMs x 8 CUDA cores per SM = 128 cores per chip.) With time slicing and fast thread switching, a streaming multiprocessor can run thousands of parallel threads on these cores.

* In June 2008, NVIDIA introduced the GT200 architecture. The GT200 retained eight CUDA cores per streaming multiprocessor but bumped the number of those multiprocessors to 30, for a total of 240 CUDA cores per chip. GT200 chips are sold as the GeForce GTX280 (for consumer PCs), Quadro FX5800 (for workstations), and Tesla T10 (for high-performance computing). These were also the first NVIDIA GPUs to support double-precision floating-point operations — unnecessary for 3D graphics but vital for many scientific and engineering programs.

*Fermi supersedes the GT200 architecture. It has 32 CUDA cores per streaming multiprocessor — four times as many as the GT200 and G80. Initially, Fermi GPUs will have 16 streaming multiprocessors, for a total of 512 CUDA cores per chip.

* In a Fermi GPU, the dual-issue pipelines are decoupled and independent. They cannot issue two instructions per cycle from the same thread. Instead, they issue two instructions per cycle from different warps. Each streaming multiprocessor can manage 48 warps. Because each warp has 32 threads, a streaming multiprocessor can manage 1,536 threads. With 16 streaming multiprocessors, a Fermi-class GPU can handle 24,576 parallel threads.

* Of course, there aren’t enough CUDA cores to execute instructions from every thread on every clock cycle. Each core executes only one instruction per cycle, so instructions from “only” 512 threads can be executing at a given moment. Switching among threads is instantaneous, so instructions from 512 different threads can execute on the next clock cycle, and so on. This massively parallel threading is the key to CUDA’s high throughput.

* Figure 5 is the highest-level view of the Fermi architecture. All 16 streaming multiprocessors — each with 32 CUDA cores — share a 768KB unified L2 cache.

* A 64-bit floating-point operation is now exactly half as fast as a 32-bit floating-point operation.

* Specifically, a Fermi-class streaming multiprocessor can execute 16 double-precision fused multiplyadd (FMA) instructions per clock cycle.

* With 16 streaming multiprocessors per chip, a Fermi GPU can execute 512 billion double-precision floatingpoint operations per gigahertz.

* Currently, the streaming multiprocessor in a GeForce GTX280 chip runs at 1.29GHz, limited by an older 65nm fabrication process. Manufactured at 40nm, a Fermi-based GPU should easily exceed that clock frequency. (NVIDIA has not yet disclosed the clock rates for any Fermi GPUs.) At 1.5GHz (our conservative estimate), the GPU would deliver 768 double-precision gigaflops — and twice as many single-precision gigaflops. (Fermi’s enhanced FMA instruction supports 32-bit operations, unlike previous generations.) At 2.0GHz, double-precision performance would reach the magic teraflops threshold, once the province of multimillion-dollar room-size supercomputers.

* Comparison of NVIDIA’s three CUDA-capable GPU architectures. The G80, introduced in 2006, opened the door to CUDA and was NVIDIA’s first architecture modified for GPU computing. The GT200 architecture followed in 2008. Although it improved performance, it retained several limitations of the G80. The new Fermi architecture overcomes most of those limitations and adds features not seen before in GPUs, such as error-correction codes (ECC).

* Additionally, the Fermi ISA improves the efficiency of “atomic” integer instructions. These are readmodify-write instructions that read a value in memory, modify the value by performing an arithmetic or logical operation, then quickly write the result back to the same memory location. Atomic instructions must be uninterruptible. Otherwise, an incomplete sequence of operations would leave the data in an unpredictable state, with possibly serious consequences for the program.

*NVIDIA GPUs have included atomic instructions since 2007, but they were relatively slow. Fermi makes them 5 to 20 times faster. For the sake of efficiency, atomic instructions are handled by special integer units attached to the L2 cache controller, not by the integer units in the CUDA cores. Executing atomic instructions in the cores would introduce too much latency as the data is shuttled across the chip.

* Efficient handling of atomic instructions is important for software developers. Two emerging standards for parallel programming have embraced the concept of atomic operations: OpenCL and DirectCompute. OpenCL is an open industry standard managed by the OpenCL Compute Working Group, which is part of an industry consortium called the Khronos Group. DirectCompute is an application programming interface (API) from Microsoft and was formerly known as Compute Shader or DirectX 11 Compute. DirectCompute is part of the larger DirectX 10 and DirectX 11 APIs in Windows Vista and Windows 7. Developers can use OpenCL or DirectCompute instead of (or in addition to) NVIDIA’s CUDA tools.

* The Fermi architecture makes the GPU programmable in object-oriented C++, not just in procedural C. Moreover, Fermi supports the entire C++ language, not a subset. Programmers can use virtual functions, try/catch statements (which handle exceptions), and make system calls, such as stdio.h (the standard input/output channel). NVIDIA provides a CUDA C/C++ compiler, but any third-party tool vendor can target the CUDA architecture.

* Unlike most other compilers, CUDA compilers don’t translate source code directly into native machine code. Instead, they target a low-level virtual machine and Parallel Thread eXecution (PTX) instruction set. The PTX virtual machine is invisible to users and delivered as part of the GPU’s graphics driver. Few people even know it’s there, but it’s very important to CUDA software developers. PTX provides a machine-independent layer that insulates high-level software from the details of the GPU architecture.

* When users install a CUDA-enabled program, the driver automatically translates PTX instructions into machine instructions for the specific NVIDIA GPU in the system. It doesn’t matter whether the GPU is based on the G80, GT200, or Fermi architecture — the installed program will be optimized for that GPU. In this way, CUDA programs run natively (full speed) but are delivered in a machine-independent format that’s ready for installation on any CUDA-capable system. (NVIDIA says more than 100 million PCs have CUDA-capable GPUs.)

* Besides bringing C++ to CUDA, the Fermi architecture with PTX 2.0 makes it easier to use other highlevel programming languages and compilers. Fermi supports FORTRAN — a vital language for scientific and engineering applications — as well as Java (via native methods) and Python. Many financial programs are written in Java, and Python is popular for rapid application development. PTX 2.0 also adds features for OpenCL and DirectCompute, such as new bit-reverse and append instructions.

* Perhaps the most exciting prospects are the applications yet to be discovered. Every leap in computing power leads to new applications that were impractical before. Unlocking the power of parallel processing has been a dream of computer scientists for decades. The advent of affordable chip-level multiprocessing may be the catalyst that makes the dream come true. If only one lesson can be drawn from the history of computing, it’s that the world has an insatiable appetite for computing power. Someone will always find a way to use it.

1 комментарий: