The Key To Flexibility And Computational Measurement Is Distributed Machine Learning

November 10, 2018

The Key To Flexibility And Computational Measurement Is Distributed Machine Learning

Distributed machine learning is emerging with the concept of “big data.” Before there was big data, there was a lot of research work to make the machine learning algorithm faster and more than multiple processors. This type of work is often referred to as “parallel computing” or “parallel machine learning,” and its core goal is to disassemble computing tasks into multiple small tasks that are distributed across multiple processors for computation.

Distributed computing or distributed machine learning, in addition to distributing computing tasks across multiple processors, more importantly distributes data (including training data and intermediate results). Because in the era of big data, the hard disk of a machine often can’t hold all the data, or even if it is installed, it will be limited by the bandwidth of the machine’s I/O channel, so that the access speed is very slow. For greater storage capacity, throughput, and fault tolerance, we all want to distribute data across multiple computers.

So what kind of data is so large that the hard disk of a machine or even hundreds of machines can not be installed? You know, the hard disk space of many servers is now a few terabytes! In fact, there are many such big data. For example, search engines have to climb a lot of web pages, analyze and index their content. How many web pages are there? This number is difficult to estimate because it changes over time.

Before the advent of Web 2.0, the growth in the number of global web pages was relatively stable, as the web pages were edited by professionals. And because various Web 2.0 tools help users build their own web pages, such as blogs and even Weibo, the number of web pages is increasing exponentially.

Another typical big data is user behavior data on e-commerce websites.

Example:

On Amazon or Taobao, many users see a lot of recommended products every day, and click on some of them. The behavior of these users clicking on recommended products will be recorded by Amazon and Taobao’s servers as input to a distributed machine learning system. The output is a mathematical model that predicts which items a user likes to see, so that the next time the recommended item is displayed, the ones that the user likes are displayed.

Today, everyone can use Google’s speech recognition system over the Internet. We will find that Google’s speech recognition system can be almost accurately identified regardless of the user’s accent, so that it is almost no longer necessary to “adapt to the owner’s accent.” And Google’s system supports more languages. The secret of this is “big data.”

Before Google released the speech recognition engine, there was a voice search service. Before the voice search service, there is a call inquiry service. In fact, the official phone service collects a lot of user voice input. This part of the data is manually labeled and is called the first batch of data for training language models and acoustic models. Subsequent voice searches collected the voices of more Internet users around the world, and with the introduction of a semi-automatic labeling system, the training data was greatly enriched. The more training data, the more accents and languages that can be covered, the higher the recognition accuracy of the model obtained by machine learning.

So if we can design a distributed machine learning system that can generalize the rules from big data, we are actually summarizing the entire human knowledge base. This sounds amazing, in fact, in the above example, Google has done it. In the last section of this series, we will introduce a semantic learning system that we have developed that sums up the millions of “semantics” in Chinese from hundreds of billions of textual data. Then, as long as the user enters any piece of text, the system can use the trained model to understand the “semantics” expressed in the text within a millisecond. This understanding process ensures that ambiguities in the text are eliminated, allowing applications such as search engines, advertising systems, and recommendation systems to better understand user needs.

In short, the Internet has given humans the first opportunity to collect behavioral data for all humans. This provides a new opportunity for machine learning, which has been in the direction of decades of research—distributed machine learning—incorporating this human knowledge from Internet data, making the machine “smarter” than any individual.