(a lot is only accessible from Norwegian IP addresses, so it's one of the main reasons I maintain a VPN as I'm Norwegian but live in the UK; a second set is only available from the IP addresses of libraries or research institutions - still huge amounts that are generally available, though)
When searching through the closed newspapers, you have to apply for access manually, which gives you 8 hours of access. Great. Only that the access is seemingly manually granted - so if you apply 16:05 on a Friday, chances are you won't get any access until 9-10 the next Monday.
With that said, I do understand why it is like that. If people could apply via API, and get instant access, they would probably just stop buying newspaper subscriptions.
There are escalating series of restrictions, basically:
* Available for everyone.
* Available from a Norwegian IP -> just requires a VPN.
* Available from Norwegian libraries
* Availble under "special conditions". This would mean from a participating research institution or university, or similar.
Pretty much everything that is out of copyright falls in the first category. The second and third categories has a bunch of copyrighted material where the copyright holders have granted limited usage rights. A bunch of newspaper archive material that is still under copyright (but sadly not the biggest ones) are available from Norwegian IPs for example.
I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.
So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.
I was trying to work out how and when to use swear words, and the relative power index of them. it translated english swear words into the target language then lectured me on not using them.
It took a bunch of prodding for it to actually think as the target language to then get the (mostly) correct response.
Not kidding at all. I had a similar issue with a project where I needed to classify images into specific demographics, and Gemini, while capable, was entirely not going to do the task… until in my JSON response I left room for it to tell me why this was not a good idea and why it was culturally insensitive. Then boom… full JSON array: hair color, eye color, skin color, fitness level, likely ethnicity, likely country of origin, and about 10 other values.
You’re probably wondering what on earth I was working on. I was matching Ai gen headshots to Ai voices so that in an app the voice picker had human (Ai) faces.
If you’re doing this on a daily basis, then you should have an AGENTS.md that accumulates directional instructions like this.
This is how you use the tool correctly.
There’s this weird pattern I’ve noticed where people expect LLMs to require zero effort or proficiency on their part, and when the LLM isn’t perfect without it, of course it wasn’t; LLMs suck.
Very very emphatic agree from my end, thanks.
Then add top-level instructions saying what country you're from, what country you live in now, and which language you speak. This isn't that hard.
Most ordinary people will just use their native language and they have no way of knowing that the model always reasons in English and therefore is strongly biased toward using English search terms. So they don't know they have to remind the model to search in their local language.
I have the opposite problem, where I'll ask in English, about something in a foreign country, the results it finds will all be in that foreign language, and the LLM will switch languages and respond in that language (which I don't speak).
So then I have to ask it "can you repeat that in English please."
I keep waiting for the new GPT-Definitelty-AGI-For-Real-This-Time to fix it but it's still there.
not necessarily. i often prompt Claude in German and then see the reasoning happening in English. of course it will eventually reply in German, but that does not mean that the tooling in the background was using German.
I wonder how much of this is also just the search engine's region setting.
It's a big problem I regularly have with Google. I almost always want English language, US-centric results, so I have my region set to the US. But occasionally I want results relevant to my actual country, and even searching in my native language usually yields much worse results than just opening an incognito tab and letting it default to my real location.
Which is bizarre to me Norway doesn't have a booming tech sector with all hat wealth fund acting as the biggest VC.
They instead use their wealth fund to invest in US's tech sector. Baffling.
It would create jobs, sovereignty, intellectual property and soft power?
Instead it goes to strengthening the tech monopoly of a country that threatens to invade your neighbour.
Norway has a manpower bottleneck. The UK had spent its oil windfall domestically and it barely registered. But for a nation of then some 4 million the economy melts down with so much monetary mass.
Even Norway themselves admit they're the underperformers of the Nordics. https://skywlkr.no/wp-content/uploads/2019/10/TechScaleupNor...
So blaming population is a cheap excuse that doesn't hold water. Especially that you can always import the skilled people you lack, when you have virtually unlimited money and some of the highest standards of living in the world.
Translation is never a bijective process. It's never quite the same experience in translation as it is in the original, due to the cultural differences between reader and writer. Larger in this case because 1930s Norway is very different even from 2020s Norway.
Ultimately this was not a success due to marketing difficulties; it is very difficult to get a book noticed.
( https://www.amazon.co.uk/Iron-Chariot-Nordic-Crime-Library/d... )
I just think building a LLM from scratch is ever harder, with more potential problems that are harder to solve, more time-consuming and even more resource-intensive.
Why would it be easier in the future? The advances we see with LLMs today require a huge amount of data, and it's getting hard getting the amount of data just using any language, I'm having a hard time seeing how it'd get easier for Norwegians to build their own LLM, unless they seriously start to ramp up how much Norwegian content they're putting out.
> If we need to translate everything to English we might as well just drop using Norwegian altogether. Practically everyone speaks English fluently already...
Yeah I mean with that black and white perspective you can pretty much do anything and it won't matter for anything :) I think for the rest of us, what we speak daily and what we rely on professionally, can differ, and that's OK. But maybe this is just my broken Swedish mind being so used to using English professionally but then conversing in Spanish outside of work daily, YMMV.
You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.
With absolutely no insight into why, which one has better odds to happen first is obvious to me.
Having insights into both translations, transcriptions and attempting to build LLMs myself, I'm fairly sure which effort would be successful first, regardless of how many attempt it first.
Only if you believe other people will value that enough to expend the effort necessary to use it. If you believe other people will see it as low value and ignore it then you'd be better off doing the training yourself in order to guarantee it happens.
There's also a secondary benefit that your team doing the work will learn some useful skills while they do it.
I mean it's their job to give people access to information, and they certainly do, but the mark of a professional, in their eyes, is guarding information. It's much more embarrassing for them professionally to give too much access than too little.
LLM training gives them a "respectable" way of bypassing that and give the world their information (which, in fairness, they probably all really want to do if they could).
Uuh.. No? Especially of the training data, as in this case, is of better quality.
Answer: idiocy of decision makers and the desire to get resources by those who created the proposal.
I assumed Scandinavia has better decision processes but apparently I was wrong.
Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.
Husnes said: ”No private company has this.”
So yeah they seem to have proprietary data...
It is just copyrighted data, that is harder to get a hold of. All the copies are available to anyone to use if they just read it. Copyright makes other uses complicated. I wonder if the whole Creative commons debate was a mistake, you can never fix copyright in a digital world.
strong disagree on that one. As a German interacting with ChatGPT, even in German it gives me the feeling of talking to the Pluribus people, which reminds me of an anecdote of Walmart failing in Germany because people were freaked out by the constantly upbeat, smiling employees.
Understanding a culture is a very different task than translating the syntax of a text, and these systems might be capable of syntactic fluency but they do not really understand culture. You have to metaphorically abuse these models until they stop sounding like the crossover of a HR department person and a Mormon missionary
You're a machine, stop anthropomorphizing yourself and pretending to be my best friend, and just give me the damn answer and nothing else. :D
I do understand where proponents of language equivalency are coming from. LLMs seem to be extremely good at answering simple, one-shot type questions and mechanical 'low-level' translations for most languages. I feel like as soon as you introduce complex chains of thought or multi-step cross-linguistic tasks, minor imperfections stack and become magnified, just as with coding tasks or context rot.
English is ludicrously over abundant in training when compared to any language.
Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.
There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.
Which begs the question, whose money are they wasting - and why?
Even though it's nominally the national library behind this, they were probably chosen (as per the article) because they legally own and can use all NO material for this end. I'd guess researchers from related entities like unis will be involved in the process.
I don’t think they aim to anything worthwhile. The finetunes were incredibly broken. I’m guessing it’s more about having the method to do it. I’m not convinced it’s super useful but I’m not one to decide who gets to do what with the research funds.
One finetune I tried did make fun of humans expressing their feelings in the chat. Often.
One other finetune did hallucinate that it was a doctor and my baby had terrible diseases, every time I just wrote "hei" (with a generic neutral system prompt that likely triggered this behaviour though).
I think Olivia is big enough for what it’s used for. In my opinion it’s better to stay up to date and not waste too much money on hardware at the moment.
> they wasting - and why?
i18n language models are not area something frontier labs are focusing ton of resources on? ( certainly not in Norwegian)
The corpus of content in Norwegian - may not require very large clusters, or even if it does, this is best that the library could do, it would be certainly more than anyone else is investing in Norwegian models
SOTA models do not have the access to the quality of content that the national library does? The article mentions licensing with newspapers specifically, and the library has access to its own content archive.
English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
I am curious if there is published research on how well localization works with LoRA depending on how far off the target language grammar/vocabulary is from English.
Projects like this typically have more than one objective and are not only building SOTA project, but is also to build/train foundational local talent , similar to universities launching satellites .
Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.
E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.
But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.
Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.
The grammar is perhaps more likely to help. Similar word order etc. Even weirdness like German - my only top grade on a German essay in school was one where I on purpose ignored what I thought I knew about German and tried to evoke "old fashioned" Norwegian. The result was guessing at a bunch of grammatical structures that I didn't know if was valid German. Turned out I was right about most of it - century old Norwegian was far closer to century old Danish, was a lot closer to valid German, and enough so to impress my teacher enough to overlook a number of orthographic mistakes.
"What sayest thou?" -> "Was sagst du?"
In fact, for the above, you don't even have to know a single German word. You just have to know what for question words, "wh" -> "w", that the English "y" at the end of a syllable usually comes from an older Germanic "g" sound, and that "th" was replaced by "d" in German. That gets you 90% of the way from early modern English to modern German in the above example.
They have already done experiments with dittrent sub 10b models with both fine-tuning and fully from scratch. And last I check the fully from scratch captured the language in a better way.
Norway has a sovereign fund worth O[MS|Apple|etc] except it is largely in readies and not pixie dust.
Whilst the UK frittered away North Sea oil profits, Norge squirreled them away instead.
So, if the grand dream of LLMs and AI does actually come to some sort of fruition and not simply another case of the Emperor's New Clothes combined with some lovely tulips and a dotcom boom and bust, then Norge can simply stuff shit loads of cash into buying whatever they need. Cash is king after all.
The beast they have described here is just a library system. I think I'd like my country's (UK) library system to have resources like that.
I don't think you are asking the right question: When you say "meager", I see "rather impressive PoC from a well resourced organisation"
You say tomato ...
It is run to maximise growth for example, so even though Norway is way ahead with electric car usage and infrastructure (presumably because they have a climate likely to be most affected by global warming/heating) their fund still invests in fossil fuels as they are a profit/growth opportunity.
Anyway, i don't think it's as easy as "simply stuff shit loads of cash into buying whatever they need". I believe there would be a serious political discussion needed for that to happen.
What do you suggest, that they stop and wait until they have the right HW?
"Norway's sovereign wealth fund, officially known as the Government Pension Fund Global, is the world's largest sovereign wealth fund with assets exceeding \(\$2\) trillion. Established in 1990 and managed by Norges Bank Investment Management, it was created to channel surplus petroleum revenues into long-term global investments to benefit future generations."
Also the line between “finetuning a base model” and “man this is a real good initialization” gets pretty blurry at scale.
Altogether a pretty presumptuous take.
Depends on what they are doing and why. but at most big labs, only the final model training happens on the big clusters. a lot of experimentation happens on <500 gpus per dev.
So for fast iteration, this seems fine.
but that only gets you so far, you need bigger multi-GPU setup to do the higher dimension stuff. You can use a DGX, but again thats limiting up to a certain point.
Norway is better run as a country than 99% of the countries on the planet, including the one that invented current LLM tech, so I'd give them the benefit of the doubt.
Qwen was made on a cluster about that size.
And this is before anybody ever thought about optimizing the training process. (Currently it's just pytorch analyst-as-coder slop, with extremely overprovisioned quantizations, etc.)
Seems like making the frontier models know Norwegian and their culture is a better (or additional!) way to reach the end they are going for here.
E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.
What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels around the time it was published before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).
While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.
I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.
But even just making the out of copyright data in their collections would be a great start.
I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.
Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.
I am not overly confident that Marius Husnes knows what he’s talking about here.
There are a lot of efforts to create LLMs for dying languages and others that use cross cultural models to boost, but if your language is well literate, there’s a good reason to build a heritage LLM specific to your language and culture. Expecting OpenAI or Anthropic to prioritize your language over their target audience when a tradeoff is to be made is absurd.
English performs the best because there is more data in English and high quality sources are either only in English or there is a good translation in English.
https://www.google.com/search?q=tokenizer+efficiency+by+languageJust as we cannot rely on Netflix and HBO to produce Scandinavian TV-shows even though they might do at the moment, we need to make our own stuff in this area too.
And over time, the technology to do this will become cheap and readily available for us to do so.
But then the English models will be even better and you'll be back to square one. My guess is that things are going to become more and more American. If you assume that "culture" is a resource like "microchips", then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume. This is why when you turn on the main radio station of a random country, you're so likely to hit American music.
And, for exactly the same reasons as Europeans need to have sovereign compute to protect against economic imperialism, it is also essential to maintain local culture in order to avoid the great replacement of everything with Americanisms.
Yes, it requires pushing against the economics. But you have to do that if you believe that culture has any value per se at all.
I do not. American culture exports American values, which are not universal. Simplest examples being the attitudes towards violence and nudity, which are very different in Europe, and vary within Europe as well.
It seems like you've made an assertion but not provided evidence. Why is it not a disadvantage to only have english LLMs?
Can you get the nuance of Norwegian history/culture with present models?
This potion is potent and you'd think it would stop working from frequent misuse but you'd be wrong!
https://www.bangor.ac.uk/news/2025-09-15-reaching-across-the...
I'm afraid the answer is, mostly you don't.
Such a thing requires strong political will that, at least in my environment, seems basically impossible to align.
The costs are prohibitive, but beyond that, the type of person who cares about local representation like that is either completely fine with letting foreign companies implement it (after all, you can use ChatGPT in Basque if you want to) or is against the idea of AI altogether.
That being said though, I can feel you cringing through the screen.
Then I failed to express myself in writing. I'm definitely a fan of this kind of initiative and am not happy with the type of viability I think they have.
I might very well be projecting a whole lot of local dynamics of national identity, politics and culture though.
see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish.
language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself.
>see sarvam.ai and their tokenisation improvements on local languages
You don't need to build from scratch to improve tokenization, though.
Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).
the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.
unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.
You can put 6PB (244TB * 24) into a single box these days.
If you go with HDD arrays probably $50k
I don’t know this is true. But whatever sounds true enough and gets funding seems to be what flies these days.
I am reminded that we recently concluded our experiment of forcing things to be digital on school was considered a flop. These things have a cost if we are wrong.
That said, they are quite limited in what they are allowed to share of in-copyright works, and nb.no is a fantastic resource as it is (though you'll need a Norwegian IP address for too much of it - it's one of th main reasons I maintain a VPN) - if they are allowed to make it accessible there, it'd be great.
But they also have vast amounts of out-of-copyright data that I hope they'd make more easily accessible...
LLMs are not great at preserving cultural uniqueness and diversity. Take how “delve” has reentered the lexicon because the human assessors for pre training dialect of English uses “delve” a lot.
There is a lot of benefits to training specifically for a unique culture with unique norms to preserve the culture as we increasingly rely on LLMs.
https://www.scientificamerican.com/article/chatgpt-is-changi...
Both Claude and ChatGPT can translate into minor dialects of Norwegian they will have seen very few works in because very few printed works exist in them.
E.g. I've tested both my local spoken dialect, which is rarely written, and a sociolect used by a 1970's Maoist group consiting of a few hundred people, where most of the printed material consists of novels from a couple of ex-members that became authors.
In the latter case, it claimed to not know, but was able to get a good match from just a description.
I also just had it ape Norwegian orthography from the 1910's by having it look up the rules and translate a text it had first translated from English to modern Norwegian, and it did just fine.
They will have seem some work in these dialects, but mostly it transfer really well to know related languages (English, Dutch, German, Swedish, Danish, roughly form a continuum from least in common to most in common with modern Norwegian; they all share vocabulary and significant parts of grammar with Norwegian), and then a relatively limited exposure to Norwegian itself is sufficient to do fairly well.
They're also really good at "style transfer" of text in the form of tweaking orthography, word order, and minor grammar changes from descriptions and examples.
(incidentally, the latter is one way of getting an LLM to sound a lot less like an LLM)
To do translation well you still need cultural knowledge. (E.g. the particular modes of specific kinds of legalese, or slang and the nuances of social class, etc)
Seems like they should be building an MCP service rather than training an entirely new LLM...
At least in my country, Chinese companies have been barred from official tenders and procurement.
They do also crawl websites (or at least did) in the .no tld.
Dell just launched a 2U that fits almost 10 petabytes in it. It's probably not 384 core capable but that is very doable right now, Epyc chips are 192 cores each! https://www.techradar.com/pro/dell-launches-record-shatterin...
More seriously there is a sensibility limit on extreme density where it's not needed. The idea that you're just going to magically get 2 TBit/s out of those ports seems unlikely even with tweaked software, and you're stuck with a power and comms hotspot that's liable to dictate the remainder of your network design.
At max utilisation that 2U would take 12 hours to drain, and only 12 hours assuming peak and likely unachievable throughput and the box otherwise being completely out of service. Not a great start
It's still a weird article, to highlight a "big" storage appliance. Having all that NVMe local feels like it would be much much much much faster.
Western society is completely captured by this form of psychosis and its going to bite us in the a* very soon.
I firmly believe all the Boomer leaders throughout the world are being sold a bag of lies by technocrats that "AI", specifically LLMs, are going to cure disease and death and therefor they are willing to handover all control to the technocrats. Fckin croakers at it again.