Talk:Common Crawl
| This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||
| The Wikimedia Foundation's Terms of Use require that editors disclose their "employer, client, and affiliation" with respect to any paid contribution; see WP:PAID. For advice about reviewing paid contributions, see WP:COIRESPONSE. Edits made by the below user(s) were last checked for neutrality on 10 May 2026 by Superb Owl.
|
External links modified
Hello fellow Wikipedians,
I have just modified one external link on Common Crawl. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive https://web.archive.org/web/20150404132256/http://blog.commoncrawl.org/ to http://blog.commoncrawl.org/
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—InternetArchiveBot (Report bug) 10:09, 11 August 2017 (UTC)
Some statistics by CommonCrawl
https://commoncrawl.github.io/cc-crawl-statistics/ Should these percentages be added? — Preceding unsigned comment added by 5.206.101.180 (talk) 18:30, 18 May 2022 (UTC)
There is another important page of statistics
Here it is: https://commoncrawl.github.io/cc-crawl-statistics/plots/charsets.html 5.206.107.6 (talk) 16:18, 19 May 2023 (UTC)
Size discrepancy?
The lead says the data set is several petabytes but the last record in the timeline says it's 386 TB. Which is it? 179.24.10.81 (talk) 14:46, 23 October 2024 (UTC)
- There is no discrepancy: April 2024 crawl is 386 TB, while the sum of all crawls is several PB. MGeog2022 (talk) 12:05, 19 April 2025 (UTC)
- Now over 10 petabytes. These numbers are all published on our website. Greg (talk) 03:49, 15 February 2026 (UTC)
Disagreements about CCF
So I'm a very early contributor to Wikipedia, but I'm a bit rusty with my skills: is there something I should do when a journalist accuses me of lying? I added my non-profit's response to the false accusation. But is there something else I should do? Greg (talk) 05:12, 15 February 2026 (UTC)
- Hi @Greg Lindahl — thanks for your contributions (it's cool to see someone who joined in 2001 still active!) and for reaching out. Because you work for Common Crawl, you have a paid conflict of interest and should follow the guidance for paid editors at Wikipedia:Conflict of interest. This includes, importantly, not directly editing the article, and instead proposing changes here on the talk page. You can use the {{Edit request}} template to request that an independent editor review them.
- I have procedurally reverted the article to the state it was in before you edited it. This is not a direct response to the content of the edits themselves, but rather a process step to ensure that the changes you are seeking are reviewed by independent editors. For the page move, your request will be more likely to be accepted if you demonstrate that "Common Crawl Foundation" has become the common name. And for the response to the Atlantic, your request will be more likely to be accepted if you show that the response has been covered in reliable sources rather than just posted to your website (a primary source).
- Cheers, Sdkb talk 13:46, 14 April 2026 (UTC)
- Thanks. I have no idea how this works,and I'm happy to have this edit dropped on the floor, if you prefer that. Greg (talk) 23:44, 9 May 2026 (UTC)
- In terms of the common name debate, Common Crawl Foundation is used here: The Register 2025, LA Times 2012, The Atlantic 2025 (mostly referred to as Common Crawl), Washington Post 2025. Superb Owl (talk) 18:38, 10 May 2026 (UTC)
- Thanks. I have no idea how this works,and I'm happy to have this edit dropped on the floor, if you prefer that. Greg (talk) 23:44, 9 May 2026 (UTC)
Content Disclaimer
Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.
- The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
- There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
- It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
- Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
- Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.