For example, the NVIDIA® GeForce RTX™ comes with tensor cores — special cores designed for dynamic calculations and mixed-precision computing. These cores can perform multiple operations in one clock cycle, making the graphics cards capable of fast data processing and well-positioned for GPU-accelerated work.
Announced March 2022, the upcoming H100 will feature fourth generation Tensor Cores which will have extended capability to handling FP8 precision formats and which NVIDIA claims will speed up large language models “by an incredible 30X over the previous generation” (Source).
These Tensor Cores enable mixed precision training operations, accelerating GPU performance by up to 32x compared to Pascal GPUs. Turing GPUs also feature Ray Tracing cores, which enhance graphic visualization properties like light and sound in 3D environments.
While CUDA cores focus on more traditional computational tasks across various industries like gaming, scientific research, and video editing, tensor cores cater specifically to AI-related workloads such as image recognition, natural language processing, and even autonomous driving.
NVIDIA RTX GPUs — capable of running a broad range of applications at the highest performance — unlock the full potential of generative AI on PCs. Tensor Cores in these GPUs dramatically speed AI performance across the most demanding applications for work and play.
DLSS uses the power of NVIDIA's supercomputers to train and regularly improve its AI model. The latest models are delivered to your GeForce RTX PC through Game Ready Drivers. Tensor Cores then use their teraFLOPS of dedicated AI horsepower to run the DLSS AI network in real time.
CUDA cores, with their parallel processing capabilities, play a significant role in these fields. Machine learning algorithms, particularly deep learning algorithms, involve performing a large number of matrix multiplications.
The RTX 3080 features 272 Tensor cores, 80 ROPs, and 68 RT cores to handle blurred objects. The double-width graphics card supports a 320W TDP and a new 12-pin power connector.
The first generation of Tensor cores used the Volta GPU micro-architecture. These cores are trained with mixed precision on FP16 number format. With V100 GPUs' 640 cores, the first-gen Tensor cores could provide up to 5x increased performance vis-a-vis earlier Pascal-series GPUs.
NVIDIA CUDA cores have been designed to be more efficient than AMD's stream processors. So, even though AMD may have more stream processors, NVIDIA's CUDA cores will still provide better performance.
PyTorch is using Tensor Cores on volta GPU as long as your inputs are in fp16 and the dimensions of your gemms/convolutions satisfy conditions for using Tensor Cores (basically, gemm dimensions are multiple of 8, or, for convolutions, batch size and input and output number of channels is multiple of 8).
In contrast to CUDA Cores, Tensor Cores are highly specialized units optimized for specific mathematical operations. While CUDA Cores offer flexibility for a wide range of parallel computing tasks, Tensor Cores provide unparalleled performance for matrix operations common in AI and HPC workloads.
Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy. The latest generation expands these speedups to a full range of workloads.
In addition, the RTX 3090 features 328 tensor cores, 96 ROPs, and 82 RT cores. Not only is this the most powerful NVIDIA graphics card yet, it's the only one in the new GeForce RTX 30 series to support NVLink.
The 4060 typically deliver around 20% more performance than the 3060. Of course, we've also noted that the RTX 4060 really feels more like a replacement for the RTX 3050, considering both offer 8GB of VRAM on a 128-bit interface.
In terms of raw performance, the RTX 3080 has the edge - but's a close race, and it's missing the next-gen tech that the 4070 offers. DLSS 3 is amazing. Seriously; the full-frame-generation it packs means that the 4070 was able to basically wreck the 3080 in most supported games compared to the previous-gen DLSS 2.0.
Summary. Offering impressive performance for gaming, simulation, AI modeling, rendering and graphical applications, the NVIDIA GeForce RTX 4080 GPU is ideal for intensive workloads. It has 16GB of GDDR6X memory with 76x RT cores, 304x Tensor cores, and 9728x CUDA cores.
CPUs also have a larger cache and memory capacity than GPUs, which can be beneficial for some AI applications. On the other hand, GPUs are better suited for handling large datasets and complex AI models. They are optimized for parallel processing, which allows them to perform many calculations simultaneously.
CUDA provides a programming model and a set of APIs that enable developers to write code that runs directly on the GPU, unlocking the potential for significant performance gains compared to traditional CPU-based computing.
NVIDIA's parallel computing architecture, known as CUDA, allows for significant boosts in computing performance by utilizing the GPU's ability to accelerate the most time-consuming operations you execute on your PC. In fact, because they are so strong, NVIDIA CUDA cores significantly help PC gaming graphics.
Plus, rumors say the RTX 5090 will feature 192 third-gen RT cores and 768 fourth-gen Tensor cores, up from 128 and 512 respectively in the RTX 4090. The RTX 5090 is anticipated to deliver 50-70% performance improvements across the board compared to the RTX 4090 – particularly excelling in 4K resolution.
The RTX 5090 is said to have a 600-watt spec, although, as VideoCardz points out, it's not clear if this refers to how much the entire GPU board draws or how much power the chip itself consumes. Either way, it looks like the RTX 5090 will draw 150 watts more than the 450 watts that the RTX 4090 pulls.
You want at least 32GB of DDR5 for your RTX 4090. If you're running an AMD processor that supports EXPO profiles, make sure you pick up a kit with the correct profile timings. And if you're looking for long-term longevity, 64GB of DDR5 is a pretty good place to start.