Problematic, racist, and pornographic web content is seemingly being used to train Google's large language models, despite efforts to filter out that strata of toxic and harmful text. An investigation by The Washington Post and the Allen Institute for AI analyzed Google's immense public C4 dataset, released for academic research, to get a better understanding [...]