CDS Professors He He and Kyunghyun Cho Lecture at Oxford Machine Learning Summer School

NYU Center for Data Science
5 min readJul 28, 2023

He discusses the international educational institute and the emerging field of machine learning applications in finance

CDS Assistant Professor, He He

This year’s Oxford Machine Learning Summer School (OxML 2023) featured lectures by Assistant Professor of Computer Science and Data Science He He on Machine Learning in Finance and Associate Professor of Computer Science and Data Science Kyunghyun Cho on Machine Learning in Health.

The international machine learning educational summit, held from July 8th through the 16th, was organized by AI for Global Goals in collaboration with the University of Oxford Deep Medicine and the Canadian Institute for Advanced Research (CIFAR). The summer courses are designed for global participants to gain cutting-edge knowledge on advanced topics in deep learning. They cover developing areas, particularly those with applications to the United Nations’ sustainable development goals (SDGs).

At CDS, He’s research focuses on increasing the trustworthiness of human-AI collaboration. Her course at OxML 2023 covered how to maintain performance, accuracy, and fairness of predictions when using machine learning in finance. To learn more about OxML 2023 and machine learning applications in finance, CDS spoke with He. Read our interview with the CDS professor below!

Could you talk a bit about the course you taught for OxML 2023?

The summer session is Machine Learning and Finance, so there were several lectures on finance and several on Natural Language Processing (NLP), which I mainly talked about. My lecture was on robustness and NLP systems, while others covered reasoning and retrieval-based language models. Overall, there was a focus on state-of-the-art, large language models and new questions coming up in the field after the introduction of ChatGPT.

While I don’t have a background in finance, I’ve found it interesting how many areas where NLP is applicable. In finance to make any decision, you need to know what’s going on in the world which is reflected in text such as company reports and newspapers. There is a lot of interest in learning how to apply machine learning to this field.

What do “robustness” and “truthfulness” for machine learning mean in the context of finance?

Robustness in statistics traditionally means you want to do well in the worst-case scenario. Generally, in machine learning robustness means you don’t want the model to make unexpected errors when the data distribution has changed from the training distribution. One example in finance is the world is changing all the time, so the data you get this week could be very different from the data you get next week. If your model is trained on a static snapshot of the world, how do we make sure this model works over time? During the COVID-19 pandemic, many financial or trading companies saw losses because there’s a novel event changing the distribution model, and it isn’t robust enough to work. There’s an interest in developing models that adapt to distribution shifts continuously.

Truthfulness is a broad term, but when people think about truthfulness, they think about how large language models generate factual statements. So, the other day I was asking a model if it knew a certain technique in a specific paper, and it gave me a bunch of explanations that looked plausible. When I checked what it mentioned, all those explanations were hallucinated, meaning they weren’t real answers. That’s a huge problem for large language models nowadays.

Were there any areas of questioning that attendees were particularly engaged with during the course?

One question students were really interested in was how we discover weak spots in machine learning. As we were talking about with robustness, some of these models make unexpected errors, so how do we go about patching those up?

To give an example, if I have an image recognition model, and instead of trying to recognize what object is in the image, the model uses the background. If wolves mostly appear in snow, and dogs mostly appear in grass, the model will look at the white background and predict a wolf. If you have a dog in the snow, the model will fail because this would be an unexpected error when the data distribution for the background changes for a specific animal. Once you can identify the background is a problem, there are a bunch of things you can do.

In practice, the big challenge is how do you find these worst case results for a model. I think this is still an open problem, and typically it relies on domain knowledge because you need to know the problem at hand to figure out the potential weak spots. I think students are curious about how we do this in a more general or principled way.

Another thing many students had questions about is the governance of large language models. Now we know these models can do great things, but there are potential risks they might generate. I think in industry most people are thinking about how I can use the model to replace my current product. They’re not thinking so much about how I should regulate or audit these models and what the negative outcomes could be.

Speaking about governance, a component of OxML 2023 is covering emerging topics in machine learning and their applications in sustainable development goals (SDGs). In what ways does machine learning in finance have applications to SDGs?

In terms of development goals, I think one important point right now is these models are mostly trained on English data. If we want to deploy these models globally, they must be adapted to different languages and cultures. For many subjective questions, people from different countries may agree with different answers. How do we make sure these language models are aligned with different groups of people as opposed to just a few people right now is an important question going forward.

Any additional thoughts about OxML 2023 overall?

I thought it was very well organized! The students were all engaged in the courses and came from diverse backgrounds. Of course, it was primarily finance and business students, but there were also students with backgrounds in healthcare, NLP-related fields, and policy. I learned a lot from the students because they brought in perspectives from a range of fields. Some of the questions they asked made me think about new angles of the problem, so overall it was a great experience.

By Meryl Phair

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.