Big data explosion? How to overcome the anxious memory calculation?

In the past few decades, the improvement in computing performance has been achieved by processing a larger amount of data faster and more accurately.

Memory and storage are now measured in gigabytes and megabytes, not kilobytes and megabytes. The processor operates on 64 bits instead of 8-bit data blocks. However, the ability of the semiconductor industry to create and collect high-quality data is growing faster than the ability to analyze data.

On the one hand, the Internet and the Internet of Things are driving data explosions. Hewlett-Packard Labs research scientist John Paul Strachan pointed out in a speech at the Leti Equipment Workshop (a side event at the IEEE Electronics Conference in December) that Only Facebook users generate 4 gigabytes (1 gigabyte = 1015 bytes) of data per day.

Real-world digital capture is even more effective with sensors, cameras and all other devices. A single car can collect 4 terabytes of data a day, and there may be millions of data in a big city in the future. The energy and bandwidth required to capture this information and upload it to a central data center is amazing.

Neural network and von Neumann bottleneck

At the same time, most of the analysis of large data sets belongs to neural networks.

The working principle of neural networks is to calculate the product sum of matrices. Load the data matrix into an array and multiply each element by a predetermined weight. In most cases, the results are passed to the next layer of the network and multiplied by a new set of weights. After a few such steps, you can draw conclusions about what the data is. It's a bit like a cat, a suspicious behavioral pattern, or some special electrical activity.

During the training phase, the conclusions of the network are compared to previously known "correct" answers. Then, a process called backpropagation uses the difference between the predicted value and the correct value to adjust each weight in each layer of the network up or down.

Conceptually, this method is very simple. But in fact, the data set is large and the calculation steps are large. The best performer in the ImageNet image classification benchmark was the use of an 8-layer neural network with 60 million parameters. It takes 20 billion operations to get an image by an algorithm at a time.

For each layer of the network, the existing weights and elements of each training example are loaded into the processor's registers, then multiplied, and the results are written back to memory. The performance bottleneck is not a calculation, but a bandwidth between the processor and the memory array. This separation between memory and processor is one of the defining features of the von Neumann architecture and exists in almost all modern computing systems.

Big data sets, bandwidth-constrained machine learning workloads, and the end of the Dennard extension are transforming industry benchmarks from raw computing performance to computational efficiency. What is the best balance between silicon area, power consumption, and computational accuracy for a given task?

Low precision, analog memory, high precision

JeffWelser, vice president and laboratory director of IBM Research Almaden, pointed out in an IEDM demonstration that neural network computing usually does not require high computational accuracy. The 16-bit computing block uses one-quarter of the circuit space required for an equivalent 32-bit block and halve the amount of data required. Even with traditional architectures, reduced precision algorithms can significantly increase computational efficiency.

The need to overcome memory bottlenecks is also driving a more radical computing memory architecture. In the simplest view of this architecture, predetermined weights are stored in an array of non-volatile memory elements. Load input data on a memory word line and sum the currents from a single cell.

How to implement such a solution in hardware is the subject of ongoing research. The industry has proposed digital and analog solutions.

For example, a digital array can be assembled from flash memory components. Researchers at the University of Minnesota demonstrated a CMOS-compatible eflash memory cell that stores charge on the floating gate between the control gate and the channel. In such an array, specific weight values ​​and their rate of change (learning rate) can be precisely controlled by a sophisticated integrated circuit design. This approach is attractive because it relies on mature, easy-to-understand component technology.

However, much of the data of interest in machine learning applications is essentially analog. Xin Zheng, a researcher at Stanford University and the University of California at Berkeley, and colleagues observed that analog-to-digital and digital-to-analog conversions and their associated power and silicon footprints can be avoided by using storage elements such as RRAMs that inherently store analog values. However, the currently available analog memory components present a new set of challenges.

The analog component can have a range of values ​​when the digital component is in an on or off state. The value stored for a given signal depends on the properties of the device. In a filamentary RRAM, the resistance drops as the conductive filaments are formed between the terminals of the device. A series of weak programming pulses can produce weak filaments, while strong pulses produce stronger filaments. Therefore, the pulse intensity and number required to store a given value depends on the kinetics of filament formation. The rate of learning depends on the separation between the resistance state and the number of pulses required to move from one state to the next.

For inference tasks, traditional CMOS logic can be used to calculate the weights and then stored in the RRAM array. The exact value achieved with a given number of programming pulses may vary from device to device, but simulations show that overall accuracy is robust in the face of these changes.

However, for learning tasks, individual weights need to be adjusted up and down as the correction propagates across the network. Unfortunately, current RRAM devices typically have an asymmetric response to SET and RESET pulses. Simply changing the sign of the programming pulse does not produce an equal adjustment in the opposite direction. This asymmetry is a major problem in the realization of learning tasks in memory.

Endurance, stability and repeatability

As mentioned above, Meiran Zhao, a graduate student at Tsinghua University, said that the learning task also requires a lot of data and a lot of weight updates, about 10 5 and 10 7 . Testing of RRAM arrays designed for traditional storage applications puts device life in the same range. However, data storage applications require digital values—whether a device is on or off—and often use strong enough settings and reset pulses to create or remove a strong conductive filament. If a weak pulse is used instead, Zhao's team indicates that the analog switch has no failed update pulse after more than 10 11 , although the learning accuracy does reduce to more than 109 update pulses.

The large number of training cycles required also threatens the stability of stored weight values. In an RRAM device, the conductivity of the filament is determined by the oxygen vacancy concentration within the filament volume. This concentration is in turn controlled by the applied voltage pulse. However, it is not possible to precisely control the location of individual job vacancies. When they migrate within the device, the exact resistance changes, either under the influence of the voltage gradient or after thermal excitation.

Another type of non-volatile memory, electrochemical RAM, attempts to address the limitations of filamentary RRAM. When the RRAM is a dual terminal device, the ECRAM is a three terminal device. The voltage control ions applied to the third terminal are inserted into the WO 3 conductor from the LiPON electrolyte layer. The electrical resistance depends on the redox reaction, which can be precisely and reproducibly controlled in the opening and closing directions.

Transcend neural network

Convolutional neural networks are the most common machine learning technique, but it is not necessarily the best. The nonlinear probabilistic behavior of emerging memory devices is a challenge for some algorithms, but may be an advantage for other algorithms.

For example, generating a counter-network uses a neural network to generate a test example for another. It is successful when the "discriminator" network is able to distinguish between real data and the "generator" network generated examples.

Thus, the discriminator network can learn to recognize the photo of the puppy by displaying a set of non-puppy images created by the generator network. One challenge in generating an antagonistic network algorithm is to generate test examples that cover all the realities of interest. "Mode Discard", where the generated examples are clustered around a limited number of categories, may be reduced due to the inherent randomness of the RRAM network. The same non-linear behavior makes accurate weights difficult to store, which may lead to more test cases.

RRAM behavior is related to history. The probability that a given reset pulse actually resets the device will decrease as the number of previously set pulses increases. A group at Imec uses this behavior as the basis for time series learning rules - devices that are valid at time t are used to predict devices that are valid at time t + Δ. The prediction is compared to the actual data, and then the device with the correct prediction is boosted by the SET pulse, while the device with the incorrect prediction is attenuated by the RESET pulse. After training, the resulting network topology is used as a model for generating new data sequences.

Finally, researchers at the University of Michigan used RRAM crossbar arrays in conjunction with random conductive bridge memory devices to solve the "spin glass" optimization problem by simulated annealing. The spin glass problem stems from physics, but is also applicable in many other fields, and it attempts to find the lowest energy state of a random two-dimensional array of interacting spins. Simulated annealing randomly flips a set of individual spins, retaining those that reduce the total energy of the system, then lowers the system temperature and repeats the process. The Michigan Group uses the random switching probability of CBRAMS to reduce the risk of finding a local minimum state rather than a true minimum energy state.

Memory computing looks to the future

Historically, electronic device research first appeared, and then electrical engineers and software developers learned how to take advantage of new features.

In the past few years, emerging storage devices have moved from lab whimsy to disappointing flash replacements to new machine learning methods.

The next few years will show whether the semiconductor industry can use these devices to help manage the big data explosion that it is helping to create.