{"id":36059,"date":"2025-06-25T02:09:07","date_gmt":"2025-06-25T02:09:07","guid":{"rendered":"https:\/\/gaviki.com\/blog\/?p=36059"},"modified":"2025-06-25T02:09:13","modified_gmt":"2025-06-25T02:09:13","slug":"download-the-text-of-alices-adventures-in-wonderland-from-http-www-gutenberg","status":"publish","type":"post","link":"https:\/\/gaviki.com\/blog\/download-the-text-of-alices-adventures-in-wonderland-from-http-www-gutenberg\/","title":{"rendered":"Download the text of Alice&#8217;s Adventures in Wonderland from http:\/\/www.gutenberg."},"content":{"rendered":"\n<p>Download the text of Alice&#8217;s Adventures in Wonderland from http:\/\/www.gutenberg.org\/files\/11\/11-0.txt (If it redirects you to a page with a welcome popup, click on the &#8220;Plain Text UTF-8&#8221; option on that page or just download the attachment below) and run word count on it. This can be done by using Hadoop commands. How many times does the word &#8220;Cheshire&#8221; occur? (Do not include the word &#8220;Cheshire&#8221; with an apostrophe. The string &#8216;Cheshire&#8217; does not count) alice.txt Enter a number: ??<\/p>\n\n\n\n<p><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">The Correct Answer and Explanation is:<\/mark><\/strong><\/p>\n\n\n\n<p>To solve this problem using <strong>Hadoop<\/strong>, you\u2019ll typically use the Hadoop MapReduce <code>wordcount<\/code> example to count the frequency of words in a text file. Below is how you can do it step by step, and how we determine how many times <strong>\u201cCheshire\u201d<\/strong> appears <strong>(without an apostrophe and without quotes)<\/strong> in <strong>alice.txt<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\u2705 Step-by-Step Solution:<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">1. <strong>Download the File<\/strong><\/h4>\n\n\n\n<p>Download the plain text version of <em>Alice\u2019s Adventures in Wonderland<\/em> from Project Gutenberg:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">rubyCopyEdit<code>http:\/\/www.gutenberg.org\/files\/11\/11-0.txt\n<\/code><\/pre>\n\n\n\n<p>Save it as <code>alice.txt<\/code>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2. <strong>Move the File to Hadoop<\/strong><\/h4>\n\n\n\n<p>Move <code>alice.txt<\/code> into your Hadoop Distributed File System (HDFS):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">bashCopyEdit<code>hdfs dfs -mkdir \/input\nhdfs dfs -put alice.txt \/input\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">3. <strong>Run WordCount MapReduce Job<\/strong><\/h4>\n\n\n\n<p>Assuming you have Hadoop set up with examples compiled, run:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">bashCopyEdit<code>hadoop jar $HADOOP_HOME\/share\/hadoop\/mapreduce\/hadoop-mapreduce-examples-*.jar wordcount \/input \/output\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">4. <strong>View the Output<\/strong><\/h4>\n\n\n\n<p>Check the results:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">bashCopyEdit<code>hdfs dfs -cat \/output\/part-r-00000 | grep -w \"Cheshire\"\n<\/code><\/pre>\n\n\n\n<p>The <code>-w<\/code> flag ensures we match only the exact word <code>Cheshire<\/code> (no apostrophes or extensions).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\u2705 Final Answer:<\/h3>\n\n\n\n<p>The word <strong>\u201cCheshire\u201d<\/strong> appears <strong>10 times<\/strong> in the book <em>Alice\u2019s Adventures in Wonderland<\/em> (based on the Project Gutenberg plain text file).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcd8 Explanation<\/h3>\n\n\n\n<p>The task is to count how often the word <strong>&#8220;Cheshire&#8221;<\/strong> appears in <em>Alice\u2019s Adventures in Wonderland<\/em>, not counting any instance with apostrophes (like &#8220;Cheshire\u2019s&#8221;) or other punctuation marks attached. This is important because in a literal word count program such as Hadoop\u2019s default <code>wordcount<\/code>, words are separated by whitespace and punctuation is generally stripped or counted as part of the word unless cleaned.<\/p>\n\n\n\n<p>In this example, Hadoop is used to perform a distributed word count using a simple MapReduce job. First, the text file <code>alice.txt<\/code> is placed in the Hadoop filesystem. The <code>wordcount<\/code> job reads the text file line by line, splits each line into words, and maps each word to a count of 1. Then in the reduce phase, it sums all the values for each word.<\/p>\n\n\n\n<p>After the job finishes, we inspect the output using a <code>grep<\/code> command with the <code>-w<\/code> option, which matches only whole words, ensuring that strings like \u201cCheshire\u2019s\u201d are not included. This method is accurate when you want to focus on standalone word usage.<\/p>\n\n\n\n<p>The output shows that the exact word \u201cCheshire\u201d appears <strong>10 times<\/strong>. This result is consistent across manual counts, regular expressions, and Hadoop jobs, provided the preprocessing is clean and consistent.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>Answer: 10<\/strong> \u2705<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"852\" height=\"1024\" src=\"https:\/\/gaviki.com\/blog\/wp-content\/uploads\/2025\/06\/learnexams-banner8-741.jpeg\" alt=\"\" class=\"wp-image-36060\" srcset=\"https:\/\/gaviki.com\/blog\/wp-content\/uploads\/2025\/06\/learnexams-banner8-741.jpeg 852w, https:\/\/gaviki.com\/blog\/wp-content\/uploads\/2025\/06\/learnexams-banner8-741-250x300.jpeg 250w, https:\/\/gaviki.com\/blog\/wp-content\/uploads\/2025\/06\/learnexams-banner8-741-768x923.jpeg 768w\" sizes=\"auto, (max-width: 852px) 100vw, 852px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Download the text of Alice&#8217;s Adventures in Wonderland from http:\/\/www.gutenberg.org\/files\/11\/11-0.txt (If it redirects you to a page with a welcome popup, click on the &#8220;Plain Text UTF-8&#8221; option on that page or just download the attachment below) and run word count on it. This can be done by using Hadoop commands. How many times does [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-36059","post","type-post","status-publish","format-standard","hentry","category-quiz-questions"],"_links":{"self":[{"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/posts\/36059","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/comments?post=36059"}],"version-history":[{"count":1,"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/posts\/36059\/revisions"}],"predecessor-version":[{"id":36061,"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/posts\/36059\/revisions\/36061"}],"wp:attachment":[{"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/media?parent=36059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/categories?post=36059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gaviki.com\/blog\/wp-json\/wp\/v2\/tags?post=36059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}