Hacker News Clone

TrackerFFMay 25, 2026, 8:44 PM

I'm a Norwegian, and I use the national library almost every day for searching through texts. They have truly one of the best working user interfaces (and functionality) for searching through the massive amounts of text.

vidarhMay 25, 2026, 10:04 PM

It's really fantastic. I just wished there were fewer restrictions on the content that is accessible.

(a lot is only accessible from Norwegian IP addresses, so it's one of the main reasons I maintain a VPN as I'm Norwegian but live in the UK; a second set is only available from the IP addresses of libraries or research institutions - still huge amounts that are generally available, though)

TrackerFFMay 26, 2026, 7:57 AM

My biggest gripe with it are the restrictions, indeed.

When searching through the closed newspapers, you have to apply for access manually, which gives you 8 hours of access. Great. Only that the access is seemingly manually granted - so if you apply 16:05 on a Friday, chances are you won't get any access until 9-10 the next Monday.

With that said, I do understand why it is like that. If people could apply via API, and get instant access, they would probably just stop buying newspaper subscriptions.

vidarhMay 27, 2026, 7:54 AM

I actually didn't realise you could apply. I always just went back and ignored the closed ones without reading closely enough apparently. Thanks for making me aware - there are a few that's relevant to me for genealogy reasons that I've not looked at because of this.

mettamageMay 26, 2026, 6:17 AM

Silly question but can a non-Norwegian also access it? Willing to pick up some Norwegian along the way ;-)

vidarhMay 26, 2026, 10:17 AM

You can access quite a bit directly. Check out nb.no (or https://www.nb.no/en/ for an English version of the page, but of course most of the works are in Norwegian)

There are escalating series of restrictions, basically:

* Available for everyone.

* Available from a Norwegian IP -> just requires a VPN.

* Available from Norwegian libraries

* Availble under "special conditions". This would mean from a participating research institution or university, or similar.

Pretty much everything that is out of copyright falls in the first category. The second and third categories has a bunch of copyrighted material where the copyright holders have granted limited usage rights. A bunch of newspaper archive material that is still under copyright (but sadly not the biggest ones) are available from Norwegian IPs for example.

TelaneoMay 26, 2026, 6:40 AM

If you have access to a Norwegian IP, then yes.

throwaway85825May 25, 2026, 11:39 PM

The lack of a universal search engine is very frustrating. Why can't I search within TV subtitles?

vintermannMay 26, 2026, 5:22 AM

Well... You realize how used you are to the basic stemming and spelling flexibility which every search engine has had since Altavista.

KeplerBoyMay 25, 2026, 10:11 PM

How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."

I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.

amarantMay 26, 2026, 12:11 AM

Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already

schubidubidubaMay 26, 2026, 6:32 AM

Of course they speak swedish. But often, they do not reason in Swedish and do not search in swedish. Swedish makes up a tiny fraction of training data, while the vast majority is English, from the US. Which means the answers will always have a bias towards US culture, even if you ask in Swedish and the LLM answers in Swedish.

NorwegianDudeMay 26, 2026, 7:23 AM

While Google does a good job with language support in their models, GPT-5.5 can't write proper Norwegian. It's even making up words that does not exist.

vintermannMay 26, 2026, 5:18 AM

Does that include local distilled models? Because it didn't last time I checked for Norwegian.

varjagMay 26, 2026, 9:08 AM

Not really. For instance Facebook speech recognition models had Swedish support but no Norwegian.

mistrial9May 26, 2026, 2:45 AM

different models have been very different in this way.. almost ten years ago the French made a very large effort to capture languages.. the release notes I read at the time IIR had quite a few languages from South Asia / India, and in Africa. The language that was prominently missing was German IIR. I cannot say for the 2025-2026 models since so much has happened.. but models are not equal.

vintermannMay 26, 2026, 4:15 AM

Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.

Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.

intronicMay 26, 2026, 4:27 AM

Yep in the article it says ..the National Library .. has the single largest digital collection of Norwegian books, newspapers, web pages .. it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage .. an agreement with Norwegian newspapers permitted LLM training on copyrighted content.

Husnes said: ”No private company has this.”

So yeah they seem to have proprietary data...

pastageMay 26, 2026, 6:49 AM

> proprietary data

It is just copyrighted data, that is harder to get a hold of. All the copies are available to anyone to use if they just read it. Copyright makes other uses complicated. I wonder if the whole Creative commons debate was a mistake, you can never fix copyright in a digital world.

internet_pointsMay 26, 2026, 9:22 AM

Maybe it can at least write like a Norwegian instead of just English-translated-into-Norwegian. It would be interesting to see if they try something like the experiments in https://arxiv.org/pdf/2507.22445 on it.

orbital-decayMay 25, 2026, 11:55 PM

Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.

Barrin92May 26, 2026, 12:42 AM

>Current-best models are pretty fluent at major languages and cultures

strong disagree on that one. As a German interacting with ChatGPT, even in German it gives me the feeling of talking to the Pluribus people, which reminds me of an anecdote of Walmart failing in Germany because people were freaked out by the constantly upbeat, smiling employees.

Understanding a culture is a very different task than translating the syntax of a text, and these systems might be capable of syntactic fluency but they do not really understand culture. You have to metaphorically abuse these models until they stop sounding like the crossover of a HR department person and a Mormon missionary

varjagMay 26, 2026, 9:14 AM

Set the personality to 'Robot', it makes the interactions so much more tolerable.

bblbMay 26, 2026, 3:59 AM

I'm Finnish and dear god I hate the default overtly friendly tones of LLMs. Always the first thing to tune in system prompt.

You're a machine, stop anthropomorphizing yourself and pretending to be my best friend, and just give me the damn answer and nothing else. :D

ampersandwhichMay 26, 2026, 7:04 AM

I fully agree. I'm Swedish and have recently used GPT to help me draft some cover letters in Swedish. Even with all the mandatory personality tweaks and prompting, it always seems to default to highly florid and self-congratulatory Americanisms if I'm not careful. It's very subtle.

I do understand where proponents of language equivalency are coming from. LLMs seem to be extremely good at answering simple, one-shot type questions and mechanical 'low-level' translations for most languages. I feel like as soon as you introduce complex chains of thought or multi-step cross-linguistic tasks, minor imperfections stack and become magnified, just as with coding tasks or context rot.

ameliusMay 26, 2026, 8:25 AM

It's probably just an excuse to play with LLMs using big government funding :)

alliaoMay 26, 2026, 12:39 AM

yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc

intendedMay 26, 2026, 4:58 AM

Quite true ?

English is ludicrously over abundant in training when compared to any language.

KeplerBoyMay 26, 2026, 6:11 AM

And that's probably necessary if you want a competent model. There simply isn't much norwegian literature on let's say banana farming.

DiogenesKynikosMay 26, 2026, 4:01 AM

As the article explains, Norway's National Library has a database of practically everything published and broadcast in Norwegian going back many decades. From the way the dataset described in the article, it does not sound like OpenAI et al. would have easy access to it in its entirety.

solenoid0937May 25, 2026, 8:39 PM

> The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores.

Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.

There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.

Which begs the question, whose money are they wasting - and why?

timmgMay 25, 2026, 9:14 PM

I wonder if instead (or in parallel), Norway should build a set of training data and share it (for free) with all the model builders.

Seems like making the frontier models know Norwegian and their culture is a better (or additional!) way to reach the end they are going for here.

vidarhMay 25, 2026, 9:45 PM

The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.

E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.

What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels around the time it was published before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).

While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.

I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.

But even just making the out of copyright data in their collections would be a great start.

e12eMay 25, 2026, 10:51 PM

Odd, I'd imagine Wikisource (in many/all languages) would be part of training data for all LLMs with SOTA ambition?

https://no.wikisource.org/wiki/De_knyttede_n%C3%A6ver

vidarhMay 25, 2026, 11:24 PM

You'd think so. It seems like there are a lot of odd gaps like that.

I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.

Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.

thatcatMay 26, 2026, 3:15 AM

the models don't retain their full training data set

vidarhMay 26, 2026, 10:18 AM

No, but they do retain enough that it is interesting what they fail to retain.

mistrial9May 26, 2026, 2:57 AM

you might be surprised if you take this approach.. give key words and phrases in small amounts, each sentence of a prompt building on a previous sentence. Take a an example that is not very hard, like Lewis Carrol Alice in Wonderland original text. Although a quick question might get things sort of wrong, or miss details, if you guide the LLM to a certain part of the story, then a certain set of characters in that part of the story, then a certain statement or dramatic moment with those characters in that part of the story, you might get very specific detail that is close to line-by-line accurate. On the other hand, if you ask a quick, ordinary question about the same part of the story without supplying context and character names, you get something equally vague. YMMV

vidarhMay 26, 2026, 10:23 AM

For the PhD thesis in question, I've actually tested a lot of requests about different parts of it, and both Claude and ChatGPT still draws a total blank if you don't let them do searches.

calgooMay 26, 2026, 8:05 AM

Why should they share all this data with the greedy american corporations that are stealing everyones data for their own profit? Much better to keep the legal agreement with the national institutions and possibly develop something actual useful to their own country.

konschubertMay 26, 2026, 9:06 AM

You are contradicting yourself. If you're hoarding the data for yourself you're not going to develop something useful. Sharing the data means that it will be integrated into the big LLMs, which will be useful "for their own country".

raframMay 26, 2026, 2:54 AM

> Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

I am not overly confident that Marius Husnes knows what he’s talking about here.

fnordpigletMay 26, 2026, 4:30 AM

He’s right though, although it’s not entirely about the training corpus. It’s about the tokenizer that tokenizes substrings more efficiently based on a necessary bias towards a target language. English oriented LLMs are more powerful for English than other languages because the token space is more parsimonious in English language. Try any online Anthropic tokenizer that calls their api with common English words (typically one or fewer tokens) and Norwegian words - you’ll often see 2-4 tokens instead sometimes more. Some languages like Thai are at a huge disadvantage. Likewise often the corpus selection also is heavily skewed towards the target language simply because more energy is applied to sourcing written works in that language. There will also be semantic biases in the vector space due to cross influence between semantically similar embeddings between languages that create a different than cultural baseline. Finally fine tuning greatly impacts cultural expression in the LLM. None of these are trivial effects.

There are a lot of efforts to create LLMs for dying languages and others that use cross cultural models to boost, but if your language is well literate, there’s a good reason to build a heritage LLM specific to your language and culture. Expecting OpenAI or Anthropic to prioritize your language over their target audience when a tradeoff is to be made is absurd.

YetAnotherNickMay 26, 2026, 6:13 AM

Did you even try to verify your claims. I tested it on few translations on wikipedia articles using [1] and it takes 15-20% more tokens for Norwegian.

English performs the best because there is more data in English and high quality sources are either only in English or there is a good translation in English.

[1]: https://platform.openai.com/tokenizer

tecleandorMay 26, 2026, 11:16 AM

Tests I've done with NO and FI texts, for the same number of characters, with the GPT5 tokenizer I get around 2x the tokens than EN. With the older tokenizers it's more like 2x or even 3x.

numpad0May 26, 2026, 11:44 AM

Tokenizer efficiency varying by languages, by as much as up to 15x, is very well known and established

  https://www.google.com/search?q=tokenizer+efficiency+by+language

chvidMay 26, 2026, 4:12 AM

When I am chatting with ChatGPT - it is fairly obvious that it is American - its native language, its style, its attitude is American - even if we chat in Danish.

Just as we cannot rely on Netflix and HBO to produce Scandinavian TV-shows even though they might do at the moment, we need to make our own stuff in this area too.

And over time, the technology to do this will become cheap and readily available for us to do so.

amunozoMay 26, 2026, 1:17 PM

I chat to it in English instead of my native Spanish not only because of performance, but because I cannot stand the unnatural style it has in Spanish.

anal_reactorMay 26, 2026, 6:37 AM

> And over time, the technology to do this will become cheap and readily available for us to do so.

But then the English models will be even better and you'll be back to square one. My guess is that things are going to become more and more American. If you assume that "culture" is a resource like "microchips", then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume. This is why when you turn on the main radio station of a random country, you're so likely to hit American music.

ikr678May 26, 2026, 6:54 AM

'Only one country should export culture, for economic efficiency' is the kind of take that the Norweigians (and everyone else) would like to protect themselves from.

pjc50May 26, 2026, 10:09 AM

> then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume

And, for exactly the same reasons as Europeans need to have sovereign compute to protect against economic imperialism, it is also essential to maintain local culture in order to avoid the great replacement of everything with Americanisms.

Yes, it requires pushing against the economics. But you have to do that if you believe that culture has any value per se at all.

wasmitnetzenMay 26, 2026, 12:27 PM

> If you assume that "culture" is a resource like "microchips"

I do not. American culture exports American values, which are not universal. Simplest examples being the attitudes towards violence and nudity, which are very different in Europe, and vary within Europe as well.

anal_reactorMay 26, 2026, 1:47 PM

Which is already changing thanks to the American influence.

isawczukMay 26, 2026, 5:24 AM

Poland have its one LLM called Bielik. It's not only better in preserving Polish sounding wording, it's also better in writing government documents. Why better? They did arena and statistically it's just better.

KaiserProMay 26, 2026, 7:10 AM

could you provide evidence to suggest he is wrong?

It seems like you've made an assertion but not provided evidence. Why is it not a disadvantage to only have english LLMs?

Can you get the nuance of Norwegian history/culture with present models?

spiderfarmerMay 26, 2026, 3:10 AM

It sounds plausible enough to get subsidies.

maxlohMay 26, 2026, 6:06 AM

[dead]

idiotsecantMay 26, 2026, 3:36 AM

You're making the mistake of thinking whether he knows what he is talking about matters. He is brewing a potion. It's ingredients are a trendy term, a vaguely spooky threat and a clear, overly simplistic solution that of course he will graciously assume control of, for the good of the motherland.

This potion is potent and you'd think it would stop working from frequent misuse but you'd be wrong!

vintermannMay 26, 2026, 4:11 AM

He won't have control over it.

seanvkMay 26, 2026, 5:28 AM

The Welsh language getting LLM training with Nemotron

https://www.bangor.ac.uk/news/2025-09-15-reaching-across-the...

LevitzMay 25, 2026, 8:37 PM

>As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders.

I'm afraid the answer is, mostly you don't.

Such a thing requires strong political will that, at least in my environment, seems basically impossible to align.

The costs are prohibitive, but beyond that, the type of person who cares about local representation like that is either completely fine with letting foreign companies implement it (after all, you can use ChatGPT in Basque if you want to) or is against the idea of AI altogether.

ttkariMay 25, 2026, 10:07 PM

I guess it's subject to debate whether the cost indeed is prohibitive in the case of Norway. They are a small but extremely wealthy country - after all, they currently hold the equivalent of 1,5% of all the listed companies globally through the investments of their sovereign wealth fund.

WarmWashMay 25, 2026, 9:15 PM

I'm sure if Norway approached the American labs with goal of making a curated datasets for training, they would absolutely get in the training door, and those models would likely run circles around anything that could be domestically done.

That being said though, I can feel you cringing through the screen.

LevitzMay 25, 2026, 11:36 PM

>That being said though, I can feel you cringing through the screen.

Then I failed to express myself in writing. I'm definitely a fan of this kind of initiative and am not happy with the type of viability I think they have.

I might very well be projecting a whole lot of local dynamics of national identity, politics and culture though.

yokoprimeMay 26, 2026, 6:06 AM

The wording in this article is a bit strange, why the extreme focus on the brand of storage media? Also, the term LLM seems to be used in a very broad way here, are they actually building a language model from scratch, or are they fine-tuning?

rldjbpinMay 26, 2026, 9:26 AM

may not be the most efficient way to go about things, but there remains a seemingly obvious use case for non-latin languages to do things from scratch.

see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish.

language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself.

[1] https://www.sarvam.ai/blogs/sarvam-30b-105b

kgeistMay 26, 2026, 9:55 AM

>but there remains a seemingly obvious use case for non-latin languages to do things from scratch

>see sarvam.ai and their tokenisation improvements on local languages

You don't need to build from scratch to improve tokenization, though.

Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).

rldjbpinMay 26, 2026, 11:45 AM

the improvements for sarvam was with the amount of tokens used to represent words in english vs non-english languages.

the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.

unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.

danborn26May 26, 2026, 9:28 AM

This is a massive storage deployment. Given the I/O demands of LLM training, especially for checkpointing, moving to this scale of NVMe flash makes sense compared to traditional disk arrays.

dmos62May 26, 2026, 8:01 AM

Huawei? You'd think that the recent European revulsion from using overseas providers would have reached Norway's public sector too.

dalemhurleyMay 25, 2026, 8:44 PM

How about that, they actually asked for permission to use data and the companies said yes.

vintermannMay 26, 2026, 5:44 AM

I think you're required by law to let the National Library have copies of books/newspapers you publish beyond a certain scale.

postepowanieadmMay 26, 2026, 6:51 AM

Norway isn't in the EU (no restrictions on Huawei) and has cheap electricity, could become an ai powerhouse.

layer8May 26, 2026, 6:14 PM

Norway is in the EEA, which have adopted the existing EU Cybersecurity Act.

postepowanieadmMay 26, 2026, 6:29 PM

Thank you, I didn't know that.

arjieMay 25, 2026, 8:53 PM

This can’t be right. 2 PB of flash is like $200k. It’s within reach of many individuals. Then again I guess you don’t need that much storage so maybe it is.

tjwebbnorfolkMay 26, 2026, 12:20 AM

Also my first thought: "Is that... a lot?"

You can put 6PB (244TB * 24) into a single box these days.

devttyeuMay 25, 2026, 8:58 PM

More like $1M at current prices at this scale / level of performance.

If you go with HDD arrays probably $50k

arjieMay 25, 2026, 9:20 PM

Boy pricing is pretty nuts these days. I have half a petabyte in Seagate enterprise drives myself and I didn’t pay anything close to that to acquire it. Such a pity about the flash storage. 2 years ago we built 200 TiB or something of flash using Samsung PM1633 or something and it was a fraction of the cost per gigabyte that $1m would imply.

rcbdevMay 26, 2026, 6:12 AM

We're in the boom phase of the cycle. The bust on these chips always comes.

metadatMay 25, 2026, 9:36 PM

Your numbers are a little off but the point remains- 2PB is nothing, not newsworthy imo. What’s special about this?

vidarhMay 25, 2026, 9:55 PM

What's special about it is not the flash but training an LLM based on the content, much of which is still in copyright and which the library has restrictions on how they are allowed to use (irrespective of the legal position of training on it) and which required an agreement with the copyright holders.

Den_VRMay 25, 2026, 8:31 PM

> He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

I don’t know this is true. But whatever sounds true enough and gets funding seems to be what flies these days.

redanddeadMay 25, 2026, 8:38 PM

They made the cultural case, you have no idea how strong this is in places like quebec, nordics, france, russia etc

sgtMay 25, 2026, 8:47 PM

Can confirm that. Norway may have a small population, but if you live there you'll think it's truly the center of the world (aside from the US. Norwegians love America)

keyboredMay 26, 2026, 9:37 PM

Bait should strive to be a bit more subtle.

elygreMay 26, 2026, 4:12 AM

Love America? Yes, we did.

Epa095May 26, 2026, 5:15 AM

It is, after all, a God which turned out not to be a God, but just America.

petterroeaMay 26, 2026, 6:34 AM

As a Norwegian I have never needed a Norwegian language model. Doing most things in Norwegian puts you at a disadvantage internationally anyways. Maybe this has value in schools, but wouldn't it just give kids more trust in relying on LLM's? My friends who work in education report that group work has become insufferable because many do not think critically and ask LLM to verify everything. I really don't see a benefit, but maybe they will find one - that is what research is for.

I am reminded that we recently concluded our experiment of forcing things to be digital on school was considered a flop. These things have a cost if we are wrong.

kvamMay 25, 2026, 8:40 PM

As a Norwegian this sounds like a mistake. Who will use this LLM? Where? For what? The underlying data could be made more easily searchable and digestible for agents in general if the goal is better knowledge of Norwegian culture.

SchlagbohrerMay 26, 2026, 7:48 AM

Sapir-Worf hypothesis but for AI

teknologistMay 26, 2026, 7:11 PM

Modern frontier LLMs know Norwegian. Wouldn't this simply be the job of RAG to do lookups as the user requests for data? Like how Gemini integrates with Google Search.

Seems like they should be building an MCP service rather than training an entirely new LLM...

6510May 26, 2026, 5:41 AM

What is called culture here will increasingly be propaganda. It reminds me of people cheering twitter as a replacement of RSS or using facebook to communicate with your customers rather than email. You won't know which will be the winning company, don't know who might control it in the future and we cant predict what it will cost. It doesn't take much to be very annoying.

ipsum2May 25, 2026, 8:33 PM

This is how much storage the average r/datahoarder user has in their basement. Fewer than 100 hard drives.

arjieMay 25, 2026, 8:55 PM

But not in flash. I have an appreciable fraction of that but in spinning rust.

DeathArrowMay 26, 2026, 6:07 AM

I thought US has already coerced most countries to not buy hardware from Huawei.

At least in my country, Chinese companies have been barred from official tenders and procurement.

kreyenborgiMay 25, 2026, 8:38 PM

Ad for Huawei?

dzhiurgisMay 25, 2026, 9:50 PM

That's about 350MB per capita. Humans can produce 2-6kb per hour. That's 13 years of non-stop typing. Wonder where it all comes from. I guess it's websites that aren't compressed / extracted.

vidarhMay 25, 2026, 10:02 PM

It's a legal deposit library, same as e.g. Library of Congress. Which means almost every published book, magazine, and newspaper and many other works published in Norway, as well as large collections of Norwegian works published abroad (such as thousands of Norwegian-language newspapers published by the Norwegian immigrant communities in the US) for many decades and a large proportion of the same from the last 200+ years are stored there.

They do also crawl websites (or at least did) in the .no tld.

jauntywundrkindMay 25, 2026, 8:28 PM

384 core cpu cluster? 2 petabytes?

Dell just launched a 2U that fits almost 10 petabytes in it. It's probably not 384 core capable but that is very doable right now, Epyc chips are 192 cores each! https://www.techradar.com/pro/dell-launches-record-shatterin...

100msMay 25, 2026, 8:56 PM

5x 400gbit running to a 2U box whoa, the PCI lanes must have heat shielding.

More seriously there is a sensibility limit on extreme density where it's not needed. The idea that you're just going to magically get 2 TBit/s out of those ports seems unlikely even with tweaked software, and you're stuck with a power and comms hotspot that's liable to dictate the remainder of your network design.

At max utilisation that 2U would take 12 hours to drain, and only 12 hours assuming peak and likely unachievable throughput and the box otherwise being completely out of service. Not a great start

abujazarMay 25, 2026, 9:34 PM

That's the in-house preprocessing hardware, not what they're training on.

jauntywundrkindMay 25, 2026, 9:57 PM

Yes!

It's still a weird article, to highlight a "big" storage appliance. Having all that NVMe local feels like it would be much much much much faster.

7eMay 25, 2026, 8:24 PM

2 PB? They will not come close to training in on that amount. Maybe years from now.

sgtMay 25, 2026, 8:48 PM

Think they will not train on the dull 2TB but use that as the data lake to start and then apply a more targeted approach.

winddudeMay 25, 2026, 9:15 PM

if you read the article 2pb is available as flash storage in the data pipeline, used to dedupe, clean, normalize, etc, for training from 60pb of raw data.

Den_VRMay 25, 2026, 8:34 PM

Could probably LoRA with that

huflungdungMay 25, 2026, 8:33 PM

[dead]

yanhangyhyMay 26, 2026, 2:22 AM

so now Huawei is not a threat to 'democracy' anymore?

dopa42365May 26, 2026, 6:14 AM

whenever Huawei want to buy billions of dollars worth of US licenses and stuff, they stop being a "national security threat" for a while because reasons

dakolliMay 25, 2026, 11:23 PM

Even entire governments are captured by a mild LLM psychosis. Which is sad in the case of Norway. I lived in Norway for two years and always found their government to be highly rational, this is not a rational use of public funds (but I suppose they have plenty of capital).

Western society is completely captured by this form of psychosis and its going to bite us in the a* very soon.

I firmly believe all the Boomer leaders throughout the world are being sold a bag of lies by technocrats that "AI", specifically LLMs, are going to cure disease and death and therefor they are willing to handover all control to the technocrats. Fckin croakers at it again.

NonHyloMorphMay 26, 2026, 9:09 AM

I think it is highly rational. You see it from the wrong point of view. It seems to be less a short utilitarian project or economic endeavour, but a cultural one. Think about it more of in terms of applied humanities. Which languages go extinct, which cultures disappear and are superseded by a monocultural globalist hegemony.

snorremdMay 26, 2026, 11:54 AM

Exactly. Nasjonalbiblioteket (National Library of Norway) has centuries of written material (Bokmål, Nynorsk and some Sami) and decades of audio and video material featuring varied dialects from all over the country. I believe training models that encompass this information can help in preserving both our language, history, and culture for future generations that increasingly turn to AI to get their information.

sspoiskMay 26, 2026, 9:35 AM

[flagged]

hottrendsMay 25, 2026, 10:48 PM

[flagged]

huss-moMay 25, 2026, 9:53 PM

[flagged]

hank808May 25, 2026, 9:36 PM

Ehhh. None of this sounds right. Translation problems maybe. Lack or technical detail understanding maybe... I don't know. Probably not news.

Norway's 2 petabytes of Huawei flash storage and LLM training

Comments