The Open Source Roots of Machine Learning

1016

The concept of machine learning, which is a subset of artificial intelligence, has been around for some time.  Ali Ghodsi, an adjunct professor at UC Berkeley, describes it as “an advanced statistical technique to make predictions on a massive amount of data.”  Ghodsi has been influential in areas of Big Data, distributed systems, and in machine learning projects including Apache Spark, Apache Hadoop, and Apache Mesos. Here, he shares insight on these projects, various use-cases, and the future of machine learning.  

Building a business around AI

There are some commonalities among these three projects that have been influenced by Ghodsi’s research. All have successful companies and business models built around them — Databricks for Apache Spark, Mesosphere for Apache Mesos, and Hortonworks in the case of Apache Hadoop.

“Around 2008-2009, people were building interesting and exciting use cases around artificial intelligence and machine learning. If you could use modern machines — hardware like GPUs — to scale your data and combine Big Data with AI, you could get fantastic results,” said Ghodsi, “But very few companies were successful at that. Everyone else had a hard time succeeding in such efforts. That’s why all these open source projects started.”

Open source theme

Another common theme across these projects and companies is open source. “We wanted to democratize that kind of technology. So, we created them as open source projects, with a community around it to enable contributions,” Ghodsi said.

These companies continue to be the leading contributors to their respective projects. For example, Databricks is one of the major contributors to Apache Spark project, but it’s not the sole contributor to the project. A true open source project won’t succeed if there isn’t a diverse community around it.  “If you want millions of developers around the world to use a new API that you’ve built, it better be open sourced. If it’s not open source, it will be challenging to get people to adopt it,” said Ghodsi.

A bit about machine learning

Machine learning replaces manual, repeatable processes. Such systems existed previously, but they used different models to achieve automation. “A lot of previous systems were rule-based: if this happens, then do that,” said Ghodsi. “But if you want to moderate billions of chat messages in real time, you can’t do that manually with people sitting around and monitoring everything.”

“You can’t do that with rule-based techniques either. The problem with rule-based techniques is that there is always a way to game them and go around them. The best way of doing it is to have a real-time machine learning engine that can be trained,” he said.

Ghodsi provided an example of a “very large company” that serves billions of people with its free chat application. Most of its users are teenagers. The company is using artificial intelligence, machine learning, and natural language processing to automatically detect any foul language or any activity that’s alarming.

The chat messages go through machine learning tools and are labeled accordingly. Over time, the machine learning algorithm starts seeing patterns in age, timing, length of messages, etc.  It will find those patterns instead of a person setting rules. It can’t be gamed, as it continues to evolve. The biggest flaw with the traditional rule-based system is that once someone figures out a way to go around those rules, it takes time to create a new set of rules and then update those rules. It’s a cat and mouse game. Machine learning overcomes that problem and becomes a very powerful tool in such cases.

Another use-case of machine learning is credit card fraud. “Every time you swipe a credit card, machine learning can help detect if it was a fraudulent swipe. It can detect anomalies in real time. Another use case is the security of corporate networks. You have billions of packages coming into your corporate network, how would you know one of them is an attack? Machine learning enables you to do that effectively in real time,” said Ghodsi.

Machine learning helping machine learning

Machine learning is also being used to help make the IT infrastructure stack intelligent, secure, and efficient. Any IT stack produces a massive among of logs that get stored somewhere. Customers pay for storage, but no one actually looks at those logs unless something breaks.  Machine learning helps mine these logs and improve the overall stack.

Databricks is using machine learning to improve Apache Spark itself. They are looking at error messages hitting customers, increase in latencies, etc. “If you look at the RPC messages that are sent to a Big Data cluster, running Apache Spark, how do you detect if it’s getting slower? Even if there is a one millisecond delay, it will have a big impact on the performance of Spark itself.  You can mine those logs that you have collected from the whole Spark computation and then use machine learning to actually optimize the stack itself,” said Ghodsi.

Mixing machine learning with cloud

Cloud, whether public or private, has become an integral part of modern IT infrastructure. Databricks users want to take advantage of the cloud. According to Ghodsi, a lot of Databricks users are using Azure Storage, Cosmos DB, SQL Data Warehouse. These users wanted better integration with Databricks. “We wanted to remove friction and enable these companies to democratize artificial intelligence,” said Ghodsi.

Databricks has partnered with Microsoft to bring Azure to its customers. As a result of combined efforts of the two companies, users get a tightly integrated solution. Ghodsi reinforced that integration is critical because it’s very challenging for enterprise users to build these AI applications. Machine learning doesn’t just happen. You don’t just write the code, and it’s done. You need to go back to the data. You need to combine it with other data sources that are coming from various places. You need to iterate back and forth.

Ghodsi provided an example of the healthcare industry. He mentioned a healthcare player who is using natural language processing for medical records, analyzing and building phenotype databases. Let’s say one patient has type two diabetes. They have the sequenced genome of the patient. The company uses machine learning to combine these two sets of data sources to find out which genome is responsible for which type two diabetes. They can use it to develop better drugs. They are dealing with a large amount of data that needs to run on a secure, compliance and scalable cloud, so they need to leverage the cloud with Apache Spark capabilities.

It’s future proof

Machine learning is going to play a massive role in the coming years. “Machine learning is going to create many new jobs,” said Ghodsi. “Putting my UC Berkeley hat on, I see tremendous interest from the students who want to study machine learning. We are going to see a generation of data scientists that doesn’t yet exist. They will think of things that you and I are not smart enough to think of. The next generation will come up with even better technologies and ideas.”