Should we embrace the hottest deep learning algorithms? Should we always adopt online learning? Should we trust the machine learning algorithms or not?
After attending the Data Science Summit in San Francisco, I left with more questions than conclusive answers. The summit was more like a hot debate, with a more inspiring conclusion: find the right trade-off in your data science problem and then... it depends.
Deep Learning or Simpler Machine Learning Models?
Deep learning was one of the hottest topics at the summit. Advanced deep learning algorithms allow us to go beyond simple parametric models to capture complex nonlinear dynamics in data. They have no doubt become very successful, powerful tools in solving many industry problems.
While many in the community praise the benefits of deep learning nowadays, Dr. Xavier Amatrian presented his own critical thinking. In his talk “Staying Shallow and Lean In a Deep Learning World,” he stated that it is sometimes bad to obsess over deep learning. More generally, we should not consider complicated yet accurate models as default solutions for all problems. Instead, we should try simpler models first, such as feature engineering, XGboost, non-parametric Bayesian methods and so forth, which can serve similar purposes under certain context. Even if deep learning or other complicated machine learning algorithms can provide a certain level of accuracy, we must also take into account factors such as system complexity, maintenance or explainability. His perspective resonated with my experience solving data science-related problems at Clari: it depends not only on the size, quality and nature of the data and resources accessible to us, but also on the right balance between the need for accuracy and the effort we spend solving the problem.
Static batch learning or online learning?
Online learning was another key word at the summit. Static batch learning techniques are those which generate the predictors by collecting the data first and learning on the entire training dataset. On the other hand, online learning is more powerful when the data comes in a sequential order, and the predictor is continuously updated at each step when new data comes in. Considering the nature of the data in many real-world problems, online learning seems to be a wiser choice, but is that always the case?
Turi Chief Architect Dr. Yucheng Low pointed out that that static models are well-known to be bad at capturing short term trends and difficult to incorporate time-series context. Online learning algorithms, although more preferable especially for streaming data, are hard to generalize to all models and hard to diagnose when things break. Consider the cases when the data changes and the features used no longer produce useful output, or when there is a long latency before the system changes are reflected by the online learning algorithms. Again, the problem becomes finding the right trade-offs in the problems you solve.
Accurate black box or weak white box?
Although not usually considered as a property of a model, explainability came up many times in the summit keynotes and discussions. When choosing a machine learning model, we usually encounter two choices: accurate but black box, or weak but white box. Many widely-used machine learning algorithms produce accurate results but are more like a black box to people, such as deep learning, random forest and others. Practical applications of machine learning systems usually call for the ability to explain why certain predictors work, so you can better understand the system, improve it by incorporating domain knowledge, or even make prescriptive suggestions. Our Chief Data Scientist Dr. Lei Tang shared more insight from the summit on this topic.
The data science summit, in my opinion, felt more like a debate on all these data science-related problems: everybody shared their valuable knowledge and experiences solving their own problems, and shed light on solving new problems. We don’t have deterministic answers to all our questions yet, but can only find our own solutions through practice. What we know for sure is that data science will lead us to a better world in the future. If you would like to learn more about Clari’s machine learning technology or share your experiences solving these problems, we’d love to hear from you.