AI benchmarking organization criticized for waiting to disclose funding from OpenAI


An organization that develops math benchmarks for AI did not disclose that it had received funding from OpenAI until recently, drawing allegations of impropriety from some in the AI ​​community.

Epoch AI, a non-profit organization funded primarily by Open Philanthropy, a research and grant foundation, revealed on December 20 that OpenAI had supported the creation of FrontierMath. FrontierMath, a test with expert-level problems designed to measure an AI’s mathematical skills, was one of the benchmarks OpenAI used to demo its next flagship AI, o3.

In a place on the LessWrong forum, a contractor for Epoch AI who goes by the username “Meemi” says that many contributors to the FrontierMath benchmark were not informed of OpenAI’s involvement until it was made public.

“Communication on this has been non-transparent,” Meemi wrote. “In my view, Epoch AI should have disclosed OpenAI funding, and contractors should have transparent information about the potential of their work used for capabilities, when choosing whether to work on a benchmark.”

On social media, some users raised concerns that the secrecy could erode FrontierMath’s reputation as an objective benchmark. In addition to supporting FrontierMath, OpenAI had access to many of the problems and solutions in the benchmark – a fact that Epoch AI did not disclose before December 20, when o3 was announced.

In a response to Meemi’s post, Tamay Besiroglu, associate director of Epoch AI and one of the organization’s co-founders, stated that FrontierMath’s integrity had not been compromised, but admitted that Epoch AI “did a mistake” to be no more. transparent

“We were limited in disclosing the partnership until the time of o3’s launch, and in retrospect we should have negotiated harder for the ability to be transparent to benchmark contributors as soon as possible,” Besiroglu wrote. “Our mathematicians deserve to know who could have access to their work. Even if we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.”

Besiroglu added that while OpenAI has access to FrontierMath, it has a “verbal agreement” with Epoch AI not to use FrontierMath’s problem to train its AI. (Training an AI on FrontierMath would be similar to teaching to the test.) Epoch AI also has a “separate holdout set” that serves as an additional safeguard for independent verification of FrontierMath’s benchmark results, Besiroglu said.

“OpenAI has … fully supported our decision to maintain a separate, invisible set of networks,” Besiroglu wrote.

However, muddying the waters, Epoch AI lead mathematician Ellot Glazer noted in a post on Reddit that Epoch AI has not been able to independently verify OpenAI’s FrontierMath o3 results.

“My personal opinion is that the (OpenAI) score is legitimate (ie, they didn’t train on the data set), and that they have no incentive to lie about internal benchmarking performance,” Glazer said. . “However, we cannot vouch for them until our independent assessment is complete.”

The saga is still another one example of the challenge of developing empirical benchmarks to evaluate AI – and securing the resources needed for benchmark development without creating the perception of conflicts of interest.



Source link