Saturday, 27 January 2018

تتحرك من المتوسط - hadoop


هادوب سبيل المثال: مرحبا العالم مع جافا، خنزير، خلية، مسيل، فيوز، أوزي، سكوب مع إنفورميكس، DB2، و ميسكل هناك الكثير من الإثارة حول البيانات الكبيرة والكثير من الارتباك للذهاب معها. توفر هذه المقالة تعريف عملي للبيانات الكبيرة ومن ثم يعمل من خلال سلسلة من الأمثلة حتى تتمكن من الحصول على فهم مباشرة لبعض قدرات هادوب، الرائدة في مجال تكنولوجيا المصدر المفتوح في مجال البيانات الكبيرة. على وجه التحديد يتيح التركيز على الأسئلة التالية. ما هي البيانات الكبيرة، هادوب، سكوب، خلية، وخنزير، ولماذا هناك الكثير من الإثارة في هذا الفضاء كيف هادوب تتعلق عب DB2 و إنفورميكس يمكن لهذه التقنيات تلعب معا كيف يمكنني البدء مع البيانات الكبيرة ما هي بعض سهلة الأمثلة التي تعمل على جهاز كمبيوتر واحد للحصول على سوبر الصبر، إذا كنت تستطيع تحديد هادوب بالفعل وتريد الحصول على الحق في العمل على عينات التعليمات البرمجية، ثم القيام بما يلي. اطلاق النار حتى إنفورميكس الخاص بك أو مثيل DB2. تحميل صورة فموير من موقع كلوديرا وزيادة إعداد ذاكرة الوصول العشوائي الجهاز الظاهري إلى 1.5 غيغابايت. انتقل إلى القسم الذي يحتوي على عينات التعليمات البرمجية. هناك مثيل ميسكل بنيت في صورة فموير. إذا كنت تقوم بتمارين دون اتصال بالشبكة، استخدم أمثلة ميسكل. لكل شخص آخر، تابع قراءة. ما هي البيانات الكبيرة البيانات الكبيرة كبيرة في الكمية، يتم التقاطها بمعدل سريع، وهي منظمة أو غير منظم، أو مزيج من ما سبق. هذه العوامل تجعل البيانات الكبيرة من الصعب التقاط، الألغام، وإدارة باستخدام الطرق التقليدية. هناك الكثير من الضجيج في هذا الفضاء أنه يمكن أن يكون هناك نقاش موسع فقط حول تعريف البيانات الكبيرة. استخدام تكنولوجيا البيانات الكبيرة لا يقتصر على كميات كبيرة. الأمثلة في هذه المقالة تستخدم عينات صغيرة لتوضيح قدرات التكنولوجيا. اعتبارا من عام 2012، مجموعات كبيرة هي في 100 بيتابيت النطاق. البيانات الكبيرة يمكن أن تكون منظمة وغير منظم. توفر قواعد البيانات العلائقية التقليدية، مثل إنفورميكس و DB2، حلول مؤكدة للبيانات المنظمة. عن طريق التمدد أنها أيضا إدارة البيانات غير منظم. التكنولوجيا هادوب يجلب تقنيات البرمجة الجديدة وأكثر سهولة للعمل على مخازن البيانات الضخمة مع كل من البيانات منظمة وغير منظم. لماذا كل الإثارة هناك العديد من العوامل المساهمة في الضجيج حول البيانات الكبيرة، بما في ذلك ما يلي. جلب حساب والتخزين معا على السلع الأجهزة: والنتيجة هي الحارقة بسرعة بتكلفة منخفضة. أداء السعر: توفر تقنية البيانات الكبيرة هادوب وفورات كبيرة في التكاليف (أعتقد عاملا من حوالي 10) مع تحسينات كبيرة في الأداء (مرة أخرى، عامل التفكير من 10). قد تختلف المسافة المقطوعة. إذا كانت التكنولوجيا الحالية يمكن أن تنحرف بشكل كبير جدا، فإنه يستحق النظر إذا هادوب يمكن أن تكمل أو تحل محل جوانب الهندسة المعمارية الحالية. قابلية التوسع الخطي: كل تقنية موازية تجعل المطالبات حول توسيع نطاق. هادوب ديه قابلية حقيقية منذ الإصدار الأخير هو توسيع الحد على عدد من العقد إلى ما بعد 4000. الوصول الكامل إلى البيانات غير المهيكل: تخزين البيانات عالية قابلة للتوسيع مع نموذج البرمجة موازية جيدة، مابريدوس، كان تحديا لهذه الصناعة لبعض الوقت. هادوبس نموذج البرمجة لا يحل جميع المشاكل، وإنما هو حل قوي للعديد من المهام. توزيعات هادوب: عب و كلوديرا واحدة من نقاط الارتباك هي، أين يمكنني الحصول على البرمجيات للعمل على البيانات الكبيرة وتستند الأمثلة في هذه المقالة على توزيع كلوديرا الحرة هادوب دعا سد (لتوزيع كلوديرا بما في ذلك هادوب). هذا متاح كصورة فموير من موقع ويب كلوديرا. أعلنت شركة آي بي إم مؤخرا أنها تقوم بتحويل منصة البيانات الكبيرة لتشغيلها على سد. مصطلح "التكنولوجيا التخريبية" مبالغ فيه بشكل مفرط، ولكن في هذه الحالة قد يكون مناسبا. ما هو هادوب وفيما يلي العديد من تعريفات هادوب، كل واحد استهداف جمهور مختلف داخل المؤسسة: بالنسبة للمديرين التنفيذيين: هادوب هو مشروع البرمجيات مفتوحة المصدر أباتشي للحصول على قيمة من فولدفيلوسيتيفاريتي لا يصدق من البيانات عن مؤسستك. استخدام البيانات بدلا من رمي معظمها بعيدا. للمديرين الفنيين: مجموعة مفتوحة المصدر من البرامج التي الألغام الألغام منظم وغير منظم بيغداتا عن الشركة. أنه يدمج مع النظام البيئي ذكاء الأعمال الموجودة لديك. القانونية: مجموعة مفتوحة المصدر من البرامج التي يتم تعبئتها ودعمها من قبل موردين متعددين. الهندسة: موازية على نطاق واسع، لا شيء مشترك، خريطة المستندة إلى جافا، والحد من بيئة التنفيذ. أعتقد مئات إلى الآلاف من أجهزة الكمبيوتر التي تعمل على نفس المشكلة، مع المدمج في مرونة الفشل. المشاريع في النظام البيئي هادوب توفر تحميل البيانات، لغات أعلى مستوى، نشر سحابة الآلي، وغيرها من القدرات. الأمان: مجموعة برامج محمية من كيربيروس. ما هي مكونات هادوب مشروع أباتشي هادوب اثنين من المكونات الأساسية، ومخزن ملف دعا هادوب نظام الملفات الموزعة (هدفس)، وإطار البرمجة دعا مابريدوس. هناك عدد من المشاريع الداعمة التي تعزز هدفس و مابريدوس. هذه المادة سوف توفر ملخصا، ويشجعك على الحصول على كتاب أوريلي هادوب الدليل النهائي، 3rd الطبعة، لمزيد من التفاصيل. المقصود من التعاريف أدناه توفير خلفية كافية لتتمكن من استخدام أمثلة التعليمات البرمجية التالية. هذه المادة هي حقا تهدف إلى أن تبدأ مع التدريب العملي على الخبرة مع التكنولوجيا. هذا هو مقالة كيف أكثر من ما هو أو يتيح مناقشة المقال. هدفس. إذا كنت تريد 4000 أجهزة الكمبيوتر للعمل على البيانات الخاصة بك، ثم يود أفضل انتشار البيانات الخاصة بك عبر 4000 أجهزة الكمبيوتر. هدفس يفعل ذلك بالنسبة لك. هدفس لديها عدد قليل من الأجزاء المتحركة. داتانوديس تخزين البيانات الخاصة بك، و نامنود بتتبع حيث يتم تخزين الأشياء. هناك قطع أخرى، ولكن لديك ما يكفي للبدء. مابريدوس. هذا هو نموذج البرمجة ل هادوب. هناك مرحلتين، وليس من المستغرب يسمى خريطة والحد. لإقناع أصدقائك نقول لهم هناك خلط المراوغة بين المرحلة وخفض المرحلة. و جوبتراكر يدير 4000 مكونات مهمة مابريدوس الخاص بك. يأخذ تاسكتراكرز أوامر من جوبتراكر. إذا كنت ترغب جافا ثم رمز في جافا. إذا كنت مثل سكل أو غيرها من اللغات غير جافا كنت لا تزال في الحظ، يمكنك استخدام أداة تسمى هادوب الجري. هادوب الجري. أداة لتمكين رمز مابريدوس في أي لغة: C، بيرل، بيثون، C، باش، وما إلى ذلك وتشمل الأمثلة مخطط بيثون ومخفض أوك. خلية وهوى. إذا كنت مثل سكل، سوف يكون من دواعي سرور أن نسمع أنه يمكنك كتابة سكل ولها خلية تحويلها إلى وظيفة مابريدوس. لا، لا تحصل على بيئة أنسي-سكل كاملة، ولكن يمكنك الحصول على 4000 الملاحظات ومتعددة بيتابيت التدرجية. هوى يمنحك واجهة رسومية المستندة إلى متصفح للقيام عمل الخلية الخاصة بك. خنزير . بيئة برمجة أعلى مستوى للقيام الترميز مابريدوس. وتسمى لغة الخنزير خنزير اللاتينية. قد تجد اصطلاحات التسمية غير تقليدية إلى حد ما، ولكن تحصل على أداء السعر لا يصدق وتوافر عالية. سكوب. يوفر ثنائية الاتجاه نقل البيانات بين هادوب وقاعدة البيانات العلائقية المفضلة لديك. أوزي. يدير سير العمل هادوب. هذا لا يحل محل جدولة أو بم الأدوات، لكنه لا يوفر إذا-ثم-آخر المتفرعة والسيطرة داخل وظائف هادوب الخاص بك. هبيس. متجر ذو قيمة أساسية قابلة للتوسيع. وهو يعمل كثيرا مثل المستمر تجزئة خريطة (لثعبان المشجعين يعتقدون القاموس). وهي ليست قاعدة بيانات علائقية على الرغم من اسم هبيس. فلومنغ. محمل الوقت الحقيقي لتدفق البيانات الخاصة بك إلى هادوب. فإنه يخزن البيانات في هدفس و هبيس. عليك أن تبدأ مع فلومنغ، مما يحسن على المسيل الأصلي. وير. توفير الغيمة ل هادوب. يمكنك بدء مجموعة في بضع دقائق فقط مع ملف تكوين قصير جدا. ماهوت. تعلم الآلة ل هادوب. تستخدم للتحليلات التنبؤية وغيرها من التحليل المتقدم. فيوز. يجعل نظام هدفس تبدو وكأنها نظام الملفات العادية حتى تتمكن من استخدام لس، آرإم، سد، وغيرها على البيانات هدفس زوكيبر. تستخدم لإدارة تزامن الكتلة. أنت لن تعمل كثيرا مع حارس الحديقة، لكنه يعمل بجد بالنسبة لك. إذا كنت تعتقد أنك بحاجة إلى كتابة برنامج يستخدم زوكيبر أنت إما جدا، جدا، ذكية ويمكن أن تكون لجنة لمشروع أباتشي، أو كنت على وشك أن يكون يوم سيء للغاية. ويبين الشكل 1 القطع الرئيسية من هادوب. الشكل 1. الهندسة المعمارية هادوب هدفس، والطبقة السفلى، يجلس على مجموعة من الأجهزة السلع. ملقمات محمولة على الرف بسيطة، ولكل منها وحدات المعالجة المركزية الأساسية 2-هيكس، 6-12 أقراص، و 32 أزعج رام. للحصول على خريطة-تقليل العمل، طبقة مصمم الخرائط يقرأ من الأقراص بسرعة عالية جدا. ينشئ مخطط الخرائط أزواج القيمة الرئيسية التي يتم فرزها وعرضها على المخفض، وتلخص طبقة المخفض أزواج المفاتيح الرئيسية. لا، لم يكن لديك لتلخيص، هل يمكن أن يكون في الواقع خريطة الحد من العمل الذي لديه مصممي الخرائط فقط. هذا ينبغي أن يصبح من الأسهل أن نفهم عندما تحصل على المثال الثعبان أوك. كيف تتكامل هادوب مع بلدي إنفورميكس أو البنية التحتية DB2 هادوب يدمج بشكل جيد جدا مع قواعد البيانات إنفورميكس و DB2 مع سكوب. سكوب هو التنفيذ المفتوح المصدر الرئيسي لنقل البيانات بين هادوب وقواعد البيانات العلائقية. ويستخدم جدبك لقراءة وكتابة إنفورميكس، DB2، ميسكل، أوراكل، ومصادر أخرى. هناك محولات الأمثل لعدة قواعد البيانات، بما في ذلك نيتيزا و DB2. الشروع في العمل: كيفية تشغيل بسيط هادوب، خلية، خنزير، أوزي، و سكوب الأمثلة كنت فعلت مع مقدمات والتعاريف، والآن حان الوقت للأشياء الجيدة. للمتابعة، سوف تحتاج إلى تحميل فموير، مربع الظاهري، أو صورة أخرى من موقع ويب كلوديرا والبدء في القيام مابريدوس تفترض الصورة الافتراضية لديك جهاز كمبيوتر 64BIT واحد من بيئات المحاكاة الافتراضية الشعبية. معظم البيئات الافتراضية لديها تحميل مجاني. عند محاولة تشغيل صورة افتراضية 64 بت قد تحصل على شكاوى حول إعدادات بيوس. ويبين الشكل 2 التغيير المطلوب في بيوس، في هذه الحالة على ThinkPad8482. توخ الحذر عند إجراء التغييرات. تتطلب بعض حزم الأمان للشركات رمز مرور بعد تغيير بيوس قبل إعادة تشغيل النظام. الشكل 2. إعدادات بيوس لضيف الظاهري 64BIT البيانات الكبيرة المستخدمة هنا هو في الواقع صغيرة نوعا ما. والهدف من ذلك هو جعل الصيد المحمول الخاص بك على النار من طحن على ملف ضخم، ولكن لتظهر لك مصادر البيانات التي هي مثيرة للاهتمام، وخريطة الحد من الوظائف التي تجيب على أسئلة ذات مغزى. تنزيل الصورة الظاهرية هادوب فمن المستحسن بشدة أن تستخدم صورة كلوديرا لتشغيل هذه الأمثلة. هادوب هي التكنولوجيا التي يحل المشاكل. يسمح لك كلوديرا صورة التعبئة والتغليف للتركيز على الأسئلة البيانات الكبيرة. ولكن إذا قررت تجميع جميع الأجزاء بنفسك، أصبح هادوب المشكلة، وليس الحل. تنزيل صورة. صورة CDH4، أحدث الطرح هو متاح هنا: صورة CDH4. النسخة السابقة، CDH3، متاحة هنا: صورة CDH3. لديك اختيارك من تقنيات المحاكاة الافتراضية. يمكنك تحميل بيئة افتراضية مجانية من فموير وغيرها. على سبيل المثال، انتقل إلى فموير وتحميل لاعب فموير. الكمبيوتر المحمول الخاص بك هو على الارجح تشغيل ويندوز لذلك يود تحميل فموير لاعب للنوافذ. الأمثلة في هذه المقالة سوف تستخدم فموير لهذه الأمثلة، وتشغيل أوبونتو لينكس باستخدام القطران بدلا من وينزيب أو ما يعادلها. بمجرد تحميلها، أونتارونزيب على النحو التالي: تار - zxvf كلوديرا-ديمو-فم-cdh4.0.0-vmware. tar. gz. أو إذا كنت تستخدم CDH3، ثم استخدم ما يلي: تار - zxvf كلوديرا-ديمو-فم-cdh3u4-vmware. tar. gz ونزيب يعمل عادة على ملفات القطران. بمجرد فك ضغط، يمكنك اطلاق النار حتى الصورة على النحو التالي: فمبلاير كلوديرا-ديمو-vm. vmx. يول الآن لديك الشاشة التي تبدو وكأنها ما هو مبين في الشكل 3. الشكل 3. كلوديرا الصورة الافتراضية الأمر فمبلاير الغطس الحق في ويبدأ الجهاز الظاهري. إذا كنت تستخدم CDH3، فإنك سوف تحتاج إلى إيقاف تشغيل الجهاز وتغيير إعدادات الذاكرة. استخدام رمز زر الطاقة بجانب الساعة في الجزء السفلي السفلي من الشاشة لإيقاف الجهاز الظاهري. لديك بعد ذلك حق الوصول للتعديل إلى إعدادات الجهاز الظاهري. ل CDH3 الخطوة التالية هي فائقة تهمة الصورة الافتراضية مع المزيد من ذاكرة الوصول العشوائي. يمكن تغيير معظم الإعدادات فقط مع تشغيل الجهاز الظاهري. ويبين الشكل 4 كيفية الوصول إلى الإعداد وزيادة ذاكرة الوصول العشوائي المخصصة لأكثر من 2GB. الشكل 4. إضافة ذاكرة الوصول العشوائي إلى الجهاز الظاهري كما هو مبين في الشكل 5، يمكنك تغيير إعداد الشبكة إلى سدها. مع هذا الإعداد الجهاز الظاهري سوف تحصل على عنوان إب الخاص بها. إذا كان هذا يخلق مشاكل على الشبكة الخاصة بك، ثم يمكنك اختياريا استخدام ترجمة عنوان الشبكة (نات). عليك استخدام الشبكة للاتصال بقاعدة البيانات. الشكل 5. تغيير إعدادات الشبكة إلى سدها تقتصر ذاكرة الوصول العشوائي على النظام المضيف، لذلك لا تحاول تخصيص أكثر من ذاكرة الوصول العشوائي مما هو موجود على الجهاز الخاص بك. إذا قمت بذلك، سيتم تشغيل الكمبيوتر ببطء شديد. الآن، لحظة كنت تنتظر، والمضي قدما والسلطة على الجهاز الظاهري. يتم تسجيل المستخدم كلوديرا تلقائيا عند بدء التشغيل. إذا كنت في حاجة إليها، وكلوديرا كلمة السر هي: كلوديرا. تثبيت إنفورميكس و DB2 تحتاج إلى قاعدة بيانات للعمل مع. إذا لم يكن لديك بالفعل قاعدة بيانات، ثم يمكنك تحميل طبعة إنفورميكس المطور هنا، أو مجانا DB2 اكسبريس - C الطبعة. بديل آخر لتثبيت DB2 هو تحميل صورة فموير التي لديها بالفعل DB2 مثبتة على نظام التشغيل لينكس سوس. تسجيل الدخول كما الجذر، مع كلمة السر: كلمة المرور. التبديل إلى المستخدم db2inst1. العمل مثل الجذر مثل قيادة السيارة دون حزام الأمان. يرجى التحدث إلى دبا المحلي ودية حول الحصول على قاعدة البيانات قيد التشغيل. لن تغطي هذه المقالة هنا. لا تحاول تثبيت قاعدة البيانات داخل الصورة الافتراضية كلوديرا لأنه لا يوجد ما يكفي من مساحة القرص الحرة. الجهاز الظاهري سيتم الاتصال بقاعدة البيانات باستخدام سكوب الذي يتطلب برنامج تشغيل جدبك. يجب أن يكون لديك برنامج تشغيل جدبك لقاعدة البيانات الخاصة بك في الصورة الظاهرية. يمكنك تثبيت برنامج تشغيل إنفورميكس هنا. يظهر برنامج تشغيل إنفورميكس جدبك (تذكر فقط السائق داخل الصورة الظاهرية وليس قاعدة البيانات) تثبيت في قائمة 1. قائمة 1. إنفورميكس تثبيت برنامج تشغيل جدبك ملاحظة: حدد دليل فرعي بالنسبة إلى هومكلوديرا حتى لا تتطلب إذن الجذر ل التركيب. برنامج تشغيل DB2 جدبك هو في شكل مضغوط، لذلك فقط بفك في الدليل الوجهة، كما هو موضح في قائمة 2. قائمة 2. DB2 جدبك برنامج التشغيل تثبيت مقدمة سريعة ل هدفس و مابريدوس قبل البدء في نقل البيانات بين قاعدة البيانات العلائقية و هادوب، تحتاج إلى مقدمة سريعة ل هدفس و مابريدوس. هناك الكثير من مرحبا على غرار الدروس العالم ل هادوب، وبالتالي فإن الأمثلة هنا تهدف إلى إعطاء خلفية كافية فقط لتدريبات قاعدة البيانات لجعل معنى لك. يوفر هدفس التخزين عبر العقد في الكتلة الخاصة بك. الخطوة الأولى في استخدام هادوب هو وضع البيانات في هدفس. رمز يظهر في قائمة 3 يحصل على نسخة من كتاب من قبل مارك توين وكتاب جيمس فينيمور كوبر ونسخ هذه النصوص في هدفس. قائمة 3. تحميل علامة توين وجيمس فينيمور كوبر في هدفس لديك الآن ملفين في دليل في هدفس. يرجى احتواء الإثارة الخاصة بك. على محمل الجد، على عقدة واحدة ومع فقط حوالي 1 ميغا بايت، وهذا هو مثير مثل مشاهدة الطلاء الجاف. ولكن إذا كان هذا هو 400 عقدة العنقودية وكان لديك 5 بيتابيتس يعيش، فإنك حقا سوف يكون مشكلة تحتوي على الإثارة الخاصة بك. العديد من الدروس هادوب استخدام المثال عدد الكلمات التي يتم تضمينها في ملف جرة سبيل المثال. اتضح أن الكثير من التحليل ينطوي على العد والتجميع. المثال في القائمة 4 يظهر لك كيفية استدعاء كلمة عداد. قائمة 4. عد الكلمات من توين وكوبر و. غز لاحقة على DS. txt. gz يقول هادوب للتعامل مع الضغط كجزء من عملية خفض خريطة. كوبر هو مطول قليلا يستحق جيدا الضغط. هناك تماما تيار من الرسائل من تشغيل وظيفة العد كلمة الخاص بك. هادوب سعيدة لتقديم الكثير من التفاصيل حول رسم الخرائط والحد من البرامج التي تعمل نيابة عنك. تظهر الخطوط الحرجة التي تريد البحث عنها في قائمة 5، بما في ذلك قائمة ثانية من وظيفة فاشلة وكيفية إصلاح واحدة من الأخطاء الأكثر شيوعا ستواجه مواجهة مابريدوس. قائمة 5. رسائل مابريدوس - مسار سعيد ماذا تعني جميع الرسائل هادوب فعلت الكثير من العمل وتحاول أن أقول لكم عن ذلك، بما في ذلك ما يلي. تحقق لمعرفة ما إذا كان ملف الإدخال موجود أم لا. تحقق لمعرفة ما إذا كان دليل الإخراج موجودا وإذا كان الأمر كذلك، قم بإيقاف المهمة. لا شيء أسوأ من الكتابة فوق ساعات من حساب عن طريق الخطأ لوحة المفاتيح بسيطة. وزعت ملف جرة جافا على جميع العقد المسؤولة عن القيام بالعمل. في هذه الحالة، هذه عقدة واحدة فقط. ركض مرحلة مخطط للمهمة. عادة هذا يوزع ملف الإدخال وينبعث زوج قيمة مفتاح. لاحظ المفتاح والقيمة يمكن أن تكون الكائنات. ركض مرحلة الفرز، الذي يصنف مخرجات مخطط استنادا إلى المفتاح. ركض مرحلة الحد، وعادة هذا يلخص تيار القيمة الرئيسية ويكتب الانتاج إلى هدفس. تم إنشاء العديد من المقاييس حول التقدم. ويبين الشكل 6 صفحة ويب عينة من المقاييس وظيفة هادوب بعد تشغيل ممارسة الخلية. الشكل 6. عينة صفحة ويب هادوب ماذا فعلت المهمة، وأين هو الإخراج كلاهما أسئلة جيدة، وتظهر في قائمة 6. قائمة 6. خريطة-تقليل الانتاج في حال قمت بتشغيل نفس المهمة مرتين ونسيان أن حذف دليل الإخراج، سوف تتلقى رسائل الخطأ المعروضة في قائمة 7. إصلاح هذا الخطأ هو بسيط مثل حذف الدليل. قائمة 7. مابريدوس الرسائل - فشل بسبب الإخراج الموجودة بالفعل في هدفس هادوب يتضمن واجهة متصفح لتفقد حالة هدفس. ويبين الشكل 7 مخرجات وظيفة عدد الكلمات. الشكل 7. استكشاف هدفس مع متصفح وحدة تحكم أكثر تطورا متاح مجانا من موقع كلوديرا على شبكة الإنترنت. أنه يوفر عددا من القدرات خارج واجهات الويب هادوب القياسية. لاحظ أن الحالة الصحية لل هدفس في الشكل 8 يظهر كما سيئة. الشكل 8. خدمات هادوب التي يديرها كلوديرا مدير لماذا هو سيء لأنه في جهاز ظاهري واحد، هدفس لا يمكن أن تجعل ثلاث نسخ من كتل البيانات. عندما كتل تحت تكرار، ثم هناك خطر فقدان البيانات، وبالتالي فإن صحة النظام سيئة. شيء جيد كنت أرينت تحاول تشغيل وظائف هادوب الإنتاج على عقدة واحدة. أنت لا تقتصر على جافا للوظائف مابريدوس الخاص بك. هذا المثال الأخير من مابريدوس يستخدم هادوب الجري لدعم مخطط رسمها في بيثون ومخفض باستخدام أوك. لا، لم يكن لديك ليكون جافا المعلم لكتابة خريطة خفض كان مارك توين لا مروحة كبيرة من كوبر. في هذه الحالة الاستخدام، هادوب سيوفر بعض النقد الأدبي بسيطة مقارنة توين وكوبر. يحسب اختبار فليشكينكيد مستوى القراءة لنص معين. أحد العوامل في هذا التحليل هو متوسط ​​طول الجملة. تحليل الجمل تبين أن تكون أكثر تعقيدا من مجرد البحث عن حرف الفترة. حزمة أوبنلب وحزمة بيثون نلتك لها محللون الجملة ممتازة. للبساطة، المثال الموضح في القائمة 8 سوف يستخدم طول الكلمة كبديل لعدد المقاطع في كلمة واحدة. إذا كنت تريد أن تأخذ هذا إلى المستوى التالي، تنفيذ اختبار فليشكينكيد في مابريديوس، الزحف على شبكة الإنترنت، وحساب مستويات القراءة للمواقع الإخبارية المفضلة لديك. قائمة 8. انتقاد الأدب القائم على بايثون الأدبية مخرجات مخطط، لكلمة توين، سيكون: 5 0. يتم فرز أطوال الكلمة العددية في النظام وعرضها على المخفض في الترتيب فرزها. في الأمثلة المبينة في قوائم 9 و 10، فرز البيانات غير مطلوب للحصول على الناتج الصحيح، ولكن تم بناء هذا النوع في البنية التحتية مابريدوس وسوف يحدث على أي حال. قائمة 9. المخفض أوك للنقد الأدبي قائمة 10. تشغيل بيثون مخطط و أوك المخفض مع هادوب الجري يمكن لمشجعي مارك توين الاسترخاء بسعادة مع العلم أن هادوب يجد كوبر لاستخدام كلمات أطول، ومع انحراف معياري صدمة (دعابة المقصود). هذا بالطبع نفترض أن أقصر الكلمات هي أفضل. يتيح المضي قدما، المقبل هو كتابة البيانات في هدفس إلى إنفورميكس و DB2. استخدام سكوب لكتابة البيانات من هدفس إلى إنفورميكس، DB2، أو الخلية عبر جدبك و سكوب أباتشي المشروع هو هادوب القائم على مصدر مفتوح هادوب إلى قاعدة بيانات فائدة حركة البيانات. تم إنشاء سكوب أصلا في هاكاثون في كلوديرا ومن ثم مفتوحة المصدر. نقل البيانات من هدفس إلى قاعدة بيانات علائقية هو حالة استخدام شائعة. هدفس و ماب-ريدوس كبيرة في القيام رفع الثقيلة. للاستعلامات بسيطة أو مخزن الخلفية لموقع على شبكة الإنترنت، التخزين المؤقت للخريطة خفض الانتاج في مخزن العلائقية هو نمط التصميم الجيد. يمكنك تجنب إعادة تشغيل خريطة تقليل عدد الكلمات فقط عن طريق سكوبينغ النتائج إلى إنفورميكس و DB2. يويف ولدت البيانات حول توين وكوبر، والآن يتيح نقله إلى قاعدة بيانات، كما هو مبين في قائمة 11. قائمة 11. إعداد برنامج جدبك يتم عرض الأمثلة المبينة في القوائم من 12 إلى 15 لكل قاعدة بيانات. يرجى الانتقال إلى المثال الذي يهمك، بما في ذلك إنفورميكس أو DB2 أو ميسكل. لقواعد تعدد الزوجات، والمتعة القيام كل مثال. إذا لم يتم تضمين قاعدة البيانات الخاصة بك من هنا، فإنه لن يكون تحديا كبيرا لجعل هذه العينات تعمل في مكان آخر. قائمة 12. إنفورميكس المستخدمين: سكوب كتابة نتائج عدد الكلمات إلى إنفورميكس قائمة 13. إنفورميكس المستخدمين: سكوب كتابة نتائج عدد الكلمات إلى إنفورميكس قائمة 14. مستخدمي DB2: سكوب كتابة نتائج عدد الكلمات إلى DB2 قائمة 15 مستخدمو ميسكل: سكوب كتابة نتائج عدد الكلمات إلى ميسكل استيراد البيانات إلى هدفس من إنفورميكس و DB2 مع سكوب إدراج البيانات في هادوب هدفس يمكن أيضا أن يتحقق مع سكوب. يتم التحكم في وظيفة ثنائية الاتجاه عن طريق المعلمة الاستيراد. قواعد بيانات العينة التي تأتي مع كل من المنتجات لديها بعض مجموعات البيانات البسيطة التي يمكنك استخدامها لهذا الغرض. قائمة 16 يظهر بناء الجملة والنتائج ل سكوبينغ كل ملقم. لمستخدمي ميسكل، يرجى تعديل بناء الجملة إما من إنفورميكس أو DB2 الأمثلة التي تتبع. قائمة 16. استيراد سكوب من إنفورميكس قاعدة بيانات عينة ل هدفس لماذا هناك أربعة ملفات مختلفة كل تحتوي فقط جزء من سكوب البيانات هو أداة موازية للغاية. إذا كان 4000 عقدة العنقودية تشغيل سكوب فعلت استيراد خانق كامل من قاعدة بيانات، فإن 4000 الاتصالات تبدو كثيرا مثل الحرمان من هجوم الخدمة ضد قاعدة البيانات. حد الاتصال الافتراضي سكوبس هو أربعة اتصالات جدبك. يقوم كل اتصال بإنشاء ملف بيانات في هدفس. وهكذا الملفات الأربعة. لا داعي للقلق، سترى كيف يعمل هادوب عبر هذه الملفات دون أي صعوبة. الخطوة التالية هي استيراد جدول DB2. كما هو موضح في قائمة 17، من خلال تحديد الخيار - m 1، يمكن استيراد جدول بدون مفتاح أساسي، والنتيجة هي ملف واحد. قائمة 17. استيراد سكوب من قاعدة بيانات نموذج DB2 إلى هدفس باستخدام خلية: الانضمام إنفورميكس و DB2 البيانات هناك حالة استخدام مثيرة للاهتمام للانضمام البيانات من إنفورميكس إلى DB2. ليست مثيرة جدا لجدولين تافهة، ولكن فوزا كبيرا لعدة تيرابايت أو بيتابايت من البيانات. هناك نهجان أساسيان للانضمام إلى مصادر البيانات المختلفة. ترك البيانات في بقية واستخدام تكنولوجيا الاتحاد مقابل نقل البيانات إلى مخزن واحد لأداء الانضمام. الاقتصاد وأداء هادوب جعل نقل البيانات إلى هدفس وأداء رفع الثقيلة مع مابريدوس خيارا سهلا. القيود المفروضة على عرض النطاق الترددي للشبكة تخلق حاجزا أساسيا إذا حاولت الانضمام إلى البيانات أثناء الراحة باستخدام تقنية نمط الاتحاد. توفر الخلية مجموعة فرعية من سكل للتشغيل على نظام المجموعة. أنها لا توفر دلالات المعاملة. انها ليست بديلا عن إنفورميكس أو DB2. إذا كان لديك بعض رفع الثقيلة في شكل الجدول ينضم، حتى لو كان لديك بعض الجداول أصغر ولكن تحتاج إلى القيام المنتجات الديكارتية سيئة، هادوب هو أداة الاختيار. لاستخدام لغة الاستعلام خلية، مطلوب مجموعة فرعية من سكل تسمى البيانات الوصفية الجدول هيفكل. يمكنك تعريف البيانات الوصفية ضد الملفات الموجودة في هدفس. سكوب يوفر اختصار مريحة مع خيار خلق خلية الجدول. يجب أن يشعر مستخدمو ميسكل بحرية تعديل الأمثلة الموضحة في قائمة 18. وسوف تكون ممارسة مثيرة للاهتمام الانضمام إلى ميسكل، أو أي جداول قاعدة بيانات علائقية أخرى، إلى جداول بيانات كبيرة. قائمة 18. الانضمام إلى informix. customer الجدول إلى الجدول db2.staff فمن أجمل بكثير عند استخدام هوى لواجهة متصفح رسومية، كما هو مبين في الأشكال 9 و 10 و 11. الشكل 9. هوى شمع العسل واجهة المستخدم الرسومية لخلية في CDH4 ، عرض هيفل الاستعلام الشكل 10. هوى شمع العسل واجهة المستخدم الرسومية ل خلية، عرض الاستعلام هيفل الشكل 11. هوى شمع العسل المتصفح الرسومية، عرض إنفورميكس-DB2 الانضمام نتيجة باستخدام الخنزير: الانضمام إنفورميكس و DB2 البيانات الخنزير هو اللغة الإجرائية. تماما مثل خلية، تحت الأغطية أنه يولد كود مابريدوس. سوف تستمر هادوب سهولة الاستخدام في التحسن مع المزيد من المشاريع تصبح متاحة. بقدر ما البعض منا حقا مثل سطر الأوامر، وهناك العديد من واجهات المستخدم الرسومية التي تعمل بشكل جيد للغاية مع هادوب. قائمة 19 يظهر رمز خنزير التي يتم استخدامها للانضمام إلى جدول العملاء وجدول الموظفين من المثال السابق. قائمة 19. خنزير مثال للانضمام إلى الجدول إنفورميكس إلى جدول DB2 كيف يمكنني اختيار جافا أو خلية أو خنزير لديك خيارات متعددة لبرمجة هادوب، وأنه من الأفضل أن ننظر في حالة الاستخدام لاختيار الأداة المناسبة لهذه المهمة . أنت لا تقتصر على العمل على البيانات العلائقية ولكن تركز هذه المقالة على إنفورميكس، DB2، و هادوب اللعب معا بشكل جيد. كتابة مئات من الخطوط في جافا لتنفيذ نمط علائقي تجزئة الارتباط هو مضيعة كاملة من الوقت منذ هذا خوارزمية هادوب مابريدوس متاحة بالفعل. كيف تختار هذا هو مسألة تفضيل شخصي. بعض مثل الترميز تعيين العمليات في سكل. يفضل البعض رمز إجرائي. يجب عليك اختيار اللغة التي سوف تجعلك الأكثر إنتاجية. إذا كان لديك أنظمة علائقية متعددة وتريد الجمع بين جميع البيانات مع أداء رائع عند نقطة السعر المنخفض، هادوب، مابريدوس، خلية، وخنزير مستعدون للمساعدة. لا تحذف البيانات الخاصة بك: المتداول قسم من إنفورميكس إلى هدفس معظم قواعد البيانات العلائقية الحديثة يمكن تقسيم البيانات. حالة الاستخدام الشائعة هي التقسيم حسب الفترة الزمنية. يتم تخزين نافذة ثابتة من البيانات، على سبيل المثال الفاصل الزمني المتداول 18 شهرا، وبعد ذلك يتم أرشفة البيانات. قدرة فصل القسم قوية جدا. ولكن بعد فصل القسم ماذا يفعل المرء مع أرشيف بيانات البيانات القديمة هو وسيلة مكلفة جدا لتجاهل بايت القديمة. وبمجرد انتقاله إلى وسيط أقل قابلية للوصول، نادرا ما يتم الوصول إلى البيانات ما لم يكن هناك شرط قانوني لمراجعة الحسابات. هادوب يوفر بديل أفضل بكثير. نقل بايت المحفوظات من القسم القديم إلى هادوب يوفر وصول عالية الأداء بتكلفة أقل بكثير من حفظ البيانات في نظام المعاملات الأصلي أو داتامارتوتاوارهوس. البيانات قديمة جدا بحيث تكون ذات قيمة معاملات، ولكنها لا تزال ذات قيمة كبيرة بالنسبة للمنظمة من أجل تحليل طويل الأمد. الأمثلة سكوب المعروضة سابقا توفر الأساسيات لكيفية نقل هذه البيانات من قسم العلائقية ل هدفس. فيوز - الوصول إلى ملفات هدفس عن طريق نفس يمكن الوصول إلى بيانات ملف InformixDB2flat في هدفس عبر نفس، كما هو مبين في قائمة 20. هذا يوفر عمليات سطر الأوامر دون استخدام واجهة هادوب فس - yadayada. من منظور استخدام حالة التكنولوجيا، نفس محدودة للغاية في بيئة البيانات الكبيرة، ولكن يتم تضمين الأمثلة للمطورين والبيانات ليست كبيرة جدا. قائمة 20. إعداد فيوز - الوصول إلى البيانات هدفس الخاص بك عن طريق نفس فلوم - إنشاء ملف جاهزة تحميل المسيل الجيل القادم، أو مسيل-نغ هو محمل مواز عالية السرعة. قواعد البيانات لديها لوادر عالية السرعة، فكيف تلعب هذه معا بشكل جيد العلاقة علاقة العلائقية ل فلوم-نغ خلق ملف تحميل جاهزة، محليا أو عن بعد، لذلك يمكن ملقم العلائقية استخدام محمل سرعة عالية. نعم، هذه الوظيفة تتداخل سكوب، ولكن تم إنشاء البرنامج النصي هو مبين في قائمة 21 بناء على طلب عميل خصيصا لهذا النمط من تحميل قاعدة البيانات. قائمة 21. تصدير البيانات هدفس إلى ملف مسطح للتحميل من قبل قاعدة بيانات أوزي - إضافة تدفق العمل لعدة وظائف أوزي سلسلة معا وظائف هادوب متعددة. هناك مجموعة لطيفة من الأمثلة المدرجة مع أوزي التي يتم استخدامها في مجموعة التعليمات البرمجية هو مبين في قائمة 22. قائمة 22. وظيفة التحكم مع أوزي هبيس، عالية الأداء مفتاح قيمة مخزن هبيس هو عالية الأداء مفتاح قيمة مخزن. إذا كانت حالة الاستخدام الخاص بك يتطلب قابلية ويتطلب فقط قاعدة البيانات ما يعادل المعاملات لصناعة السيارات في ارتكاب، هبيس قد يكون جيدا التكنولوجيا لركوب. هبيس ليس قاعدة بيانات. الاسم مؤسف منذ بعض، مصطلح قاعدة يعني قاعدة البيانات. وهي تقوم بعمل ممتاز في المتاجر ذات القيمة العالية عالية الأداء. هناك بعض التداخل بين وظائف هبيس، إنفورميكس، DB2 وغيرها من قواعد البيانات العلائقية. لمعاملات أسيد، الامتثال الكامل سكل، وفهارس متعددة قاعدة بيانات علائقية التقليدية هو خيار واضح. هذا التمرين رمز الماضي هو إعطاء الألفة الأساسية مع هبيس. أنها بسيطة عن طريق التصميم ولا تمثل بأي حال من الأحوال نطاق وظيفة هباسيس. يرجى استخدام هذا المثال لفهم بعض القدرات الأساسية في هبيس. هبيس، الدليل النهائي، من قبل لارس جورج، إلزامية القراءة إذا كنت تخطط لتنفيذ أو رفض هبيس لحالة الاستخدام الخاصة بك. هذا المثال الأخير، المبين في القوائم 23 و 24، يستخدم واجهة ريست المقدمة مع هبيس لإدراج قيم المفاتيح في جدول هبيس. اختبار تسخير هو أساس حليقة. قائمة 23. إنشاء جدول هبيس وإدراج صف قائمة 24. باستخدام واجهة هبيس ريست الاستنتاج نجاح باهر، كنت جعلت من النهاية، أحسنت هذا هو مجرد بداية فهم هادوب وكيف يتفاعل مع إنفورميكس و DB2. في ما يلي بعض الاقتراحات لخطواتك التالية. أخذ الأمثلة المعروضة سابقا وتكييفها مع الخوادم الخاصة بك. يول تريد استخدام البيانات الصغيرة نظرا لعدم وجود مساحة كبيرة في الصورة الافتراضية. الحصول على شهادة كمسؤول هادوب. زيارة موقع كلوديرا للدورات واختبار المعلومات. الحصول على شهادة كمطور هادوب. بدء كتلة باستخدام الإصدار المجاني من مدير كلوديرا. تبدأ مع عب بيج شيتس تشغيل أعلى CDH4. الموارد القابلة للتنزيل موضوعات ذات صلةتنسيق هبيس و هايف هذا المحتوى هو جزء من سلسلة: سكل ل هادوب والعودة مرة أخرى، الجزء 2 لا تنزعج لمحتوى إضافي في هذه السلسلة. هايف و هبيس دمج هادوب و سكل مع إنفوسفير بيجينزيتس InfoSphere174 BigInsights8482 يجعل التكامل بين قواعد البيانات هادوب و سكل أبسط من ذلك بكثير، لأنه يوفر الأدوات والميكانيكا اللازمة لتصدير واستيراد البيانات بين قواعد البيانات. باستخدام إنفوسفير بيجينزيتس، يمكنك تحديد مصادر قاعدة البيانات، وجهات النظر، والاستعلامات، ومعايير التحديد الأخرى، ثم تحويل ذلك تلقائيا إلى مجموعة متنوعة من الأشكال قبل استيراد تلك المجموعة مباشرة إلى هادوب (انظر المواضيع ذات الصلة لمزيد من المعلومات). على سبيل المثال، يمكنك إنشاء استعلام يقوم باستخراج البيانات وملء صفيف جسون مع بيانات السجل. وبمجرد تصديرها، يمكن إنشاء وظيفة لمعالجة ومعالجة البيانات قبل عرضها أو استيراد البيانات المصنعة وتصدير البيانات مرة أخرى إلى DB2. خلية هو حل مستودع البيانات التي لديها رقيقة مثل الاستعلام سكل لغة تسمى هيفيقل. يتم استخدام هذه اللغة للاستعلام عن البيانات، وأنه يحفظ لك من كتابة معالجة مابريدوس الأصلي للحصول على البيانات الخاصة بك بها. منذ كنت تعرف بالفعل سكل، خلية هو حل جيد لأنه يتيح لك الاستفادة من المعرفة سكل الخاص بك للحصول على البيانات داخل وخارج أباتشي هادوب. One limitation of the Hive approach, though, is that it makes use of the append-only nature of HDFS to provide storage. This means that it is phenomenally easy to get the data in, but you cannot update it. Hive is not a database but a data warehouse with convenient SQL querying built on top of it. Despite the convenient interface, particularly on very large datasets, the fact that the query time required to process requests is so large means that jobs are submitted and results accessed when available. This means that the information is not interactively available. HBase, by comparison, is a key-value (NoSQL) data store that enables you to write, update, and read data randomly, just like any other database. But its not SQL. HBase enables you to make use of Hadoop in a more traditional real-time fashion than would normally be possible with the Hadoop architecture. Processing and querying data is more complex with HBase, but you can combine the HBase structure with Hive to get an SQL-like interface. HBase can be really practical as part of a solution that adds the data, processes it, summarizes it through MapReduce, and stores the output for use in future processing. In short, think of Hive as an append-only SQL database and HBase as a more typical read-write NoSQL data store. Hive is useful for SQL integration if you want to store long-term data to be processed and summarized and loaded back. Hives major limitation is query speed. When dealing with billions of rows, there is no live querying of the data that would be fast enough for any interactive interface to the data. For example, with data logging, the quantities of data can be huge, but what you often need is quick, flexible querying on either summarized or extreme data (i. e. faults and failures). HBase is useful when what you want is to store large volumes of flexible data and query that information, but you might want only smaller datasets to work with. Hence, you might export data that simultaneously: Needs to be kept whole, such as sales or financial data May change over time Also needs to be queried HBase can then be combined with traditional SQL or Hive to allow snapshots, ranges, or aggregate data to be queried. Making use of Hive The primary reason to use Hive over a typical SQL database infrastructure is simply the size of the data and the length of time required to perform the query. Rather than dumping information into Hadoop, writing your own MapReduce query, and getting the information back out, with Hive you can (normally) write the same SQL statement, but on a much larger dataset. Hive accomplishes this task by translating the SQL statement into a more typical Hadoop MapReduce job that assembles the data into a tabular format. This is where the limitation comes, in that Hive is not a real-time, or live querying solution. Once you submit the job, it can take a long time to get a response. A typical use of Hive is to combine the real-time accessibility of data in a local SQL table, export that information into Hive long-term, and reimport the processed version that summarizes the data so it can be used in a live query environment, as seen in Figure 1. Figure 1. Typical use of Hive We can use the same SQL statement to get and obtain the data in both situations, a convenience that helps to harmonize and streamline your applications. Getting data into Hive from SQL The simplest way to get data in and out is by writing a custom application that will extract the required data from your existing SQL table and insert that data into Hive 8212 that is, perform a select query and use INSERT to place those values directly into Hive. Alternatively, depending on your application type, you might consider inserting data directly into both your standard SQL and Hive stores. This way, you can check your standard SQL for recent queries and process Hive data on an hourlydailyweekly schedule as required to produce the statistical data you require longer-term. Remember, with Hive, in the majority of cases we can: Use the tabular data directly Export without conversion Export without reformatting or restructuring More typically, you will be dumping entire tables or entire datasets by hour or day into Hive through an intermediary file. One benefit of this approach is that we easily introduce the file by running a local import or by copying that data directly into HDFS. Lets look at this with an example, using the City of Chicago Traffic Tracker that studies bus data, giving a historical view of the speed of buses in different regions of Chicago at different times. A sample of the data is shown in Listing 1. Listing 1. Sample of bus data from City of Chicago Traffic Tracker The sample dataset is just over 2 million records, and it has been loaded from a CSV export. To get that information into Hive, we can export it to a CSV file. Listing 2. Exporting data to a CSV file Then copy the data into HDFS: hdfs dfs - copyFromLocal chicago. csv. Now open Hive and create a suitable table. Listing 3. Opening Hive and creating a suitable table The first thing to notice with Hive is that compared to most standard SQL environments, we have a somewhat limited set of data types that can be used. Although core types like integers, strings, and floats are there, date types are limited. That said, Hive supports reading complex types, such as hashmaps. The core types supported by Hive: Integers (1-8 bytes) Boolean FloatDouble String 8212 any sequence of characters, and therefore good for CHAR, VARCHAR, SET, ENUM, TEXT, and BLOB types if the BLOG is storing text Timestamp 8212 either an EPOCH or YYYY-MM-DD hh:mm:ss. fffffffff - formatted string Binary 8212 for BLOBs that are not TEXT The second observation is that we are defining the table structure. There are binary structures within Hive, but using CSV natively is convenient for our purposes since weve exported the data to a CSV file. When loading the data, the CSV format specification here will be used to identify the fields in the data. The above example creates a standard table (the table within Hives data store within HDFS). You can also create an external table that uses the copied file directly. However, in an SQL-to-Hive environment, we want to make use of one big table into which we can append new data. Now the data can be loaded into the table: hivegt load data inpath chicago. csv into table chicagobus . This code adds the contents of the CSV file to the existing table. In this case, it is empty, but you can see how easy it would be to import additional data. Processing and querying Hive data Once the data is loaded, you can execute Hive queries from the Hive shell just as you would in any other SQL environment. For example, Listing 4 shows the same query to get the first 10 rows. Listing 4. Query to get the first 10 rows The benefit comes when we perform an aggregate query. For example, lets obtain an average bus speed for each region by day. Listing 5. Obtaining average bus speed for each region by day As you can see from the output, the query is fundamentally the same as with MySQL (we are grouping by an alternative value), but Hive converts this into a MapReduce job, then calculates the summary values. The reality with data of this style is that the likelihood of requiring the speed in region 1, for example, at 9:50 last Thursday is quite low. But knowing the average speed per day for each region might help predict the timing of traffic or buses in the future. The summary data can be queried and analyzed efficiently with a few thousand rows in an SQL store to allow the data to be sliced and diced accordingly. To output that information back to a file, a number of options are available, but you can simply export back to a local file using the statement in Listing 6. Listing 6. Exporting back to a local file This code creates a directory (chicagoout), into which the output is written as a series of text files. These can be loaded back into MySQL, but by default, the fields are separated by CtrlA . The output can be simplified to a CSV file again by creating a table beforehand, which uses the CSV formatting. Listing 7. Simplifying to a CSV file Now rerun the job and insert the information into the table. Listing 8. Rerunning the job You can now find the files that make up the table in your Hive datawarehouse directory, so you can copy them out for loading into your standard SQL store, for example, using LOAD DATA INFILE. Listing 9. Loading files into your standard SQL store This process sounds clunky, but it can be automated, and because we have files from each stage, it is easy to re-execute or reformat the information if required. Using views If you are using Hive in the manner suggested earlier, and regularly processing and summarizing data daily, weekly or monthly, it might be simpler to create a view. Views within Hive are logical 8212 that is, the output of the view gets re-created each time a query is executed. Although using views is more expensive, for a data exchange environment, views hugely simplify the process by simplifying the query structure and allowing consistent output as the underlying source tables expand with new data. For example, to create a view from our original speedregion summary, use Listing 10. Listing 10. Creating a view from our original speedregion summary Now we can perform Hive queries on the view including new selections. Listing 11. Performing Hive queries on the view Not looking good for traffic in region 28, is it Data life-cycle management If you decide to use Hive as a live component of your querying mechanism 8212 exporting data from SQL, into Hive, and back out again so it can be used regularly 8212 give careful thought as to how you manage the files to ensure accuracy. I tend to use the following basic sequence: Insert new data into a table 8212 datalog, for example. When datalog is full (i. e. an hour, day, week, or month of information is complete), the table is renamed (to datalogarchive, for example), and a new table (same structure) is created. Data is exported from datalogarchive and appended into the Hive table for the data. Depending on how the data is used and processed, analysis occurs by accessing the live data or by running the exact same SQL query statement on Hive. If the data is needed quickly, a view or query is executed that imports the corresponding data back into an SQL table in a summarized format. For example, for systems logging data (RAM, disk, and other usages) of large clusters, the data is stored in SQL for a day. This approach allows for live monitoring and makes it possible to spot urgent trends. Data is written out each day to Hive where the log content is analyzed by a series of views that collect extreme values (for example, disk space less than 5 percent), as well as average disk usage. While reviewing the recent data, its easy to examine and correlate problems (extreme disk usage and increased CPU time, for example), and execute the same query on the Hive long-term store to get the detailed picture. Using HBase Whereas Hive is useful for huge datasets where live queries are not required, HBase allows us to perform live queries on data, but it works differently. The primary difference is that HBase is not a tabular data store, so importing tabular data from an SQL store is more complex. That said, the flexible internal structure of HBase is also more flexible. Data sources of multiple different data structures can be merged together within HBase. For example, with log data, you can store multiple sensor data into a single table within HBase, a situation that would require multiple tables in an SQL store. Getting data in HBase from SQL Unlike Hive, which supports a native tabular layout for the source data, HBase stores key-value pairs. This key-value system complicates the process of exporting data and using it directly because it first needs to be identified and then formatted accordingly to be understood within HBase. Each item (or item identifier) requires a unique key. The unique ID is important because it is the only way to get individual data back again the unique ID locates the record within the HBase table. Remember that HBase is about key-value pairs, and the unique ID (or key) is the identifier to the stored record data. For some data types, such as the log data in our Hive examples, the unique key is meaningless because we are unlikely to want to view just one record. For other types, the data may already have a suitable unique ID within the table you want to use. This dilemma can be solved by pre-processing our output, for example, and inserting a UUID() into our output. Listing 12. Pre-processing the output This code creates a new UUID for each row of output. The UUID can be used to identify each record 8212 even though for this type of data, that identification is not individually useful. A secondary consideration within the export process is that HBase does not support joins. If you want to use HBase to write complex queries on your SQL data, you need to run a query within your SQL store that outputs an already-joined or aggregate record. Within HBase, tables are organized according to column families, and these can be used to bond multiple groups of individual columns, or you can use the column families as actual columns. The translation is from the table to the document structure, as shown in Figure 2. Figure 2. Translation from table to the document structure To import the data, you have to first create the table. HBase includes a basic shell for accepting commands. We can open it and create a table and a table group called cf. Listing 13. Creating a table group called cf Copy the tab-separated file created earlier into HDFS: hdfs dfs - copyFromLocal chicago. tsv. Now we can run importtsv, a tool inside the HBase JAR that imports values from a tab-delimited file. Listing 14. Running importtsv The content needs to be split to make it understandable: hadoop jar usrlibhbasehbase. jar importtsv. This code runs the importtsv tool, which is included as part of the HBase JAR: - Dimporttsv. columnsHBASEROWKEY, logtime, region, buscount, readnumber, speed . The tool defines the columns that will be imported and how they will be identified. The fields are defined as a list at least one of them must be the identifier (UUID) for each row, specified by HBASEROWKEY. and the others define the field names (within the column family, cf) used for each input column. chicago 8212 The table name. It must have been created before this tool is executed. chicago. tsv 8212 The name of the file in HDFS to be imported. The output from this command (see Listing 15) is rather immense, but the import process is complicated. The data cannot be directly loaded. Instead, it gets parsed by a MapReduce process that extracts and then inserts the data into an HBase table. Listing 15. Output from importtsv command If you get a bad-lines output that shows a high number of errors, particularly if the number equals the number of rows you are importing, the problem is probably the format of the source file or the fact that the number of columns in the source file does not match the number of columns defined in the import specification. Once the data has been imported, we can use the shell to get one record to check that the import has worked. Listing 16. Checking to see if the import has worked You can see the basic structure of the data as it exists within the HBase table. The unique ID identifies each record, then individual key-value pairs contain the detail (i. e. the columns from the original SQL table). Alternative SQL or Hive to HBase An alternative model to the raw-data export (less common to HBase because of the record structure) is to use HBase to store summary values and parsedcomposed queries. Because the data from HBase is stored in a readily and quickly accessible format (access the key and get the data), it can be used to access chunks of data that have been computed from other jobs, stored into HBase, and used to access the summary data. For example, the summary data we generated using Hive earlier in this example could have been written into HBase to be accessed quickly to provide statistical data on the fly for a website. Using HBase data from Hive Now that we have the data in HBase, we can start querying and reporting on the information. The primary advantage of HBase is its powerful querying facilities based on the MapReduce within Hadoop. Since the data is stored internally as simple key-value combinations, it is easy to process through MapReduce. MapReduce is no solution for someone from the SQL world, but we can take advantage of the flexible nature of Hives processing model to crunch HBase data using the HQL interface. You may remember earlier I described how Hive supports processing of mapped data types this is what HBase data is: mapped key-value pairs. To use HBase data, we need to create a table within Hive that points to the HBase table and maps the key-value pairs in HBase to the column style of Hive. Listing 17. Creating a table within Hive that points to the HBase table and maps the key-value pairs in HBase to the column style of Hive The first part of this code creates a table definition identical to the one we used natively in Hive, except that we have added the row-key UUID as the first column. The STORED BY block defines the storage format. The SERDEPROPERTIES block is the mapping between the document structure and the columns. The colon separates the key name and corresponding value and how the data should be mapped to the columns, in sequence, from the table definition. The TBLPROPERTIES block defines the name of the HBase table where the data lies. Once the table has been created, the table can be queried through Hive using native SQL, just as we saw earlier. Why use this method instead of a native import The primary reason is the ease with which it can be queried (although no longer live), but also because the underlying HBase data can be updated, rather than just appended to. In an SQL-to-Hadoop architecture this advantage means we can take regular dumps of changing data from SQL and update the content. Reminder: HBase or Hive Given the information here, its worth reminding ourselves of the benefits of the two systems. Table 1. Benefits of the two systems Which one you use will depend entirely on your use case and the data you have available, and how you want to query it. Hive is great for massive processing of ever-increasing data. HBase is useful for querying data that may change over time and need to be updated. Conclusion The primary reason for moving data between SQL stores and Hadoop is usually to take advantage of the massive storage and processing capabilities to process quantities of data larger than you could hope to cope with in SQL alone. How you exchange and process that information from your SQL store into Hadoop is, therefore, important. Large quantities of long-term data that need to be queried more interactively can take advantage of the append-only and SQL nature of Hive. For data that needs to be updated and processed, it might make more sense to use HBase. HBase also makes an ideal output target from Hive because its so easy to access summary data directly by using the native key-value store. When processing, you also need to consider how to get the data back in. With Hive, the process is easy because we can run SQL and get a table that can easily be imported back to our SQL store for straightforward or live query processing. In this article, Ive covered a wide range of use cases and examples of how data can be exchanged more easily from SQL using tabular interfaces to the Hadoop and non-tabular storage underneath. Downloadable resources Related topics Check out the Chicago Traffic Tracker . Read Whats the big deal about Big SQL to learn about IBMs SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with an easy on-ramp for querying data managed by Hadoop. Dig deeper in the Developing Big SQL queries to analyze big data tutorial in the InfoSphere BigInsights tutorial collection (PDF). Analyzing social media and structured data with InfoSphere BigInsights teaches the basics of using BigSheets to analyze social media and structured data collected through sample applications provided with BigInsights. Read Understanding InfoSphere BigInsights to learn more about the InfoSphere BigInsights architecture and underlying technologies. Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights video to listen to Cindy Saracco discuss some of the frequently asked questions about IBMs Big Data platform and InfoSphere BigInsights. Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big Data -- Analyzing Social Media for Watson . Read Exploring your InfoSphere BigInsights cluster and sample applications to learn more about the InfoSphere BigInsights web console. Learn about the IBM Watson research project. Check out Big Data University for free courses on Hadoop and big data. Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data for details on two of IBMs key big data technologies. Learn more about Apache Hadoop . Check out HadoopDB . Read Using MapReduce and load balancing on the cloud to learn how to implement the Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing to improve the performance of both a single - and multiple-node system. For information on installing Hadoop using CDH4, see CDH4 Installation - Cloudera Support . Big Data Glossary , by Pete Warden, OReilly Media, ISBN 1449314597, 2011 and Hadoop: The Definitive Guide , by Tom White, OReilly Media, ISBN 1449389732, 2010, offer more information. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads explores the feasibility of building a hybrid system that takes the best features from both technologies. Learn more by reading MapReduce and parallel DBMSes: friends or foes A Survey of Large Scale Data Management Approaches in Cloud Environments gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud, which are gaining a lot of momentum in both research and industrial communities. Get Hadoop 0.20.1, Hadoop MapReduce, and Hadoop HDFS from Apache. org . Find resources to help you get started with InfoSphere BigInsights. IBMs Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets. Download InfoSphere BigInsights Quick Start Edition. available as a native software installation or as a VMware image. Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights. Find resources to help you get started with InfoSphere Streams. IBMs high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources. Download InfoSphere Streams. available as a native software installation or as a VMware image. Use InfoSphere Streams on IBM SmartCloud Enterprise . Sign in or register to add and subscribe to comments. Hadoop Tutorial 8211 Getting Started with HDP Introduction Hello World is often used by developers to familiarize themselves with new concepts by building a simple program. This tutorial aims to achieve a similar purpose by getting practitioners started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application. This tutorial describes how to refine data for a Trucking IoT Data Discovery (aka IoT Discovery) use case using the Hortonworks Data Platform. The IoT Discovery use cases involves vehicles, devices and people moving across a map or similar surface. Your analysis is targeted to linking location information with your analytic data. For our tutorial we are looking at a use case where we have a truck fleet. Each truck has been equipped to log location and event data. These events are streamed back to a datacenter where we will be processing the data. The company wants to use this data to better understand risk. Here is the video of Analyzing Geolocation Data to show you what you8217ll be doing in this tutorial. Pre-Requisites: Downloaded and Installed Hortonworks Sandbox Before entering hello HDP labs, we highly recommend you go through Learning the Ropes of the Hortonworks Sandbox to become familiar with the Sandbox in a VM and the Ambari Interface. Data Set Used: Geolocation. zip Optional . Hortonworks ODBC driver installed and configured see the tutorial on installing the ODBC driver for Windows or OS X. Refer to Installing and Configuring the Hortonworks ODBC driver on Windows 7 Installing and Configuring the Hortonworks ODBC driver on Mac OS X In this tutorial, the Hortonworks Sandbox is installed on an Oracle VirtualBox virtual machine (VM) your screens may be different. Tutorial Overview In this tutorial, we will provide the collected geolocation and truck data. We will import this data into HDFS and build derived tables in Hive. Then we will process the data using Pig, Hive and Spark. The processed data is then visualized using Apache Zeppelin. To refine and analyze Geolocation data, we will: Review some Hadoop Fundamentals Download and extract the Geolocation data files. Load the captured data into the Hortonworks Sandbox. Run Hive, Pig and Spark scripts that compute truck mileage and driver risk factor. Visualize the geolocation data using Zeppelin. Goals of the Tutorial The goal of this tutorial is that you get familiar with the basics of following: Hadoop and HDP Ambari File User Views and HDFS Ambari Hive User Views and Apache Hive Ambari Pig User Views and Apache Pig Apache Spark Data Visualization with Zeppelin (Optional) Tutorial QampA and Reporting Issues If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don8217t find your answer you can post a new HCC question for this tutorial using the Ask Questions button. Tutorial Name: Hello HDP An Introduction to Hadoop with Hive and Pig If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab. All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Contribution Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it In this tutorial, we will explore important concepts that will strengthen your foundation in the Hortonworks Data Platform (HDP). Apache Hadoop is a layered structure to process and store massive amounts of data. In our case, Apache TM Hadoop will be recognized as an enterprise solution in the form of HDP. At the base of HDP exists our data storage environment known as the Hadoop Distributed File System. When data files are accessed by Hive, Pig or another coding language, YARN is the Data Operating System that enables them to analyze, manipulate or process that data. HDP includes various components that open new opportunities and efficiencies in healthcare, finance, insurance and other industries that impact people. Pre-Requisites 1st Concept: Hadoop amp HDP 1.1 Introduction In this module you will learn about Apache TM Hadoop and what makes it scale to large data sets. We will also talk about various components of the Hadoop ecosystem that make Apache Hadoop enterprise ready in the form of Hortonworks Data Platform (HDP) distribution. This module discusses Apache Hadoop and its capabilities as a data platform. The core of Hadoop and its surrounding ecosystem solution vendors provide enterprise requirements to integrate alongside Data Warehouses and other enterprise data systems. These are steps towards the implementation of a modern data architecture, and towards delivering an enterprise 8216Data Lake8217 1.2 Goals of this module Understanding Hadoop. Understanding five pillars of HDP Understanding HDP components and their purpose. 1.3 Apache Hadoop Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop. Refer to the blog reference below for more information on Hadoop. The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules. Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop YARN a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users8217 applications. Hadoop MapReduce a programming model for large scale data processing. Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles. There are five pillars to Hadoop that make it enterprise ready: Data Management Store and process vast quantities of data in a storage layer that scales linearly. Hadoop Distributed File System (HDFS) is the core technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. Apache Hadoop YARN is the pre-requisite for Enterprise Hadoop as it provides the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels. Apache Hadoop YARN Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models. HDFS Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. Data Access Interact with your data in a wide variety of ways from batch to real-time. Apache Hive is the most widely adopted data access technology, though there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing, Apache HBase offers columnar NoSQL storage and Apache Accumulo offers cell-level access control. All of these engines can work across one set of data and resources thanks to YARN and intermediate engines such as Apache Tez for interactive access and Apache Slider for long-running applications. YARN also provides flexibility for new and emerging data access methods, such as Apache Solr for search and programming frameworks such as Cascading. Apache Hive Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS. Apache Pig A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs. MapReduce MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. Apache Spark Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets. Apache Storm Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop 2.x Apache HBase A column-oriented NoSQL data storage system that provides random real-time readwrite access to big data for user applications. Apache Tez Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing. Apache Kafka Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher throughput, replication, and fault tolerance. Apache HCatalog A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. Apache Slider A framework for deployment of long-running data access applications in Hadoop. Slider leverages YARN8217s resource management capabilities to deploy those applications, to manage their lifecycles and scale them up or down. Apache Solr Solr is the open source platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the world8217s largest Internet sites. Apache Mahout Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering. Apache Accumulo Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google8217s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper. Data Governance and Integration Quickly and easily load data, and manage according to policy. Apache Falcon provides policy-based workflows for data governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS. Apache Falcon Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop. It enables users to orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows. Apache Flume Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop. Apache Sqoop Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources. Security Address requirements of Authentication, Authorization, Accounting and Data Protection. Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox. Apache Knox The Knox Gateway (8220Knox8221) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster. Apache Ranger Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Operations Provision, manage, monitor and operate Hadoop clusters at scale. Apache Ambari An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters. Apache Oozie Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. Apache ZooKeeper A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information. Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing. To learn more about Apache Hadoop, watch the following introduction: 1.4 Hortonworks Data Platform (HDP) Hortonworks Data Platform (HDP) is a packaged software Hadoop distribution that aims to ease deployment and management of Hadoop clusters. Compared with simply downloading the various Apache code bases and trying to run them together a system, HDP greatly simplifies the use of Hadoop. Architected, developed, and built completely in the open, HDP provides an enterprise ready data platform that enables organizations to adopt a Modern Data Architecture. With YARN as its architectural center it provides a data platform for multi-workload data processing across an array of processing methods from batch through interactive to real-time, supported by key capabilities required of an enterprise data platform spanning Governance, Security and Operations. The Hortonworks Sandbox is a single node implementation of HDP. It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. The tutorials and features in the Sandbox are oriented towards exploring how HDP can help you solve your business big data problems. The Sandbox tutorials will walk you through how to bring some sample data into HDP and how to manipulate it using the tools built into HDP. The idea is to show you how you can get started and show you how to accomplish tasks in HDP. HDP is free to download and use in your enterprise and you can download it here: Hortonworks Data Platform 1.5 Suggested Readings 2nd Concept: HDFS 2.1 Introduction A single physical machine gets saturated with its storage capacity as data grows. With this growth comes the impending need to partition your data across separate machines. This type of File system that manages storage of data across a network of machines is called a Distributed File System. HDFS is a core component of Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware. With Hortonworks Data Platform HDP 2.2, HDFS is now expanded to support heterogeneous storage media within the HDFS cluster. 2.2 Goals of this module Understanding HDFS architecture Understanding Hortonworks Sandbox Amabri File User View 2.3 Hadoop Distributed File System HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will 8220just work8221 under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage. An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to: replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, or shut down the node. With the next generation HDFS data architecture that comes with HDP 2.4, HDFS has evolved to provide automated failure with a hot standby, with full stack resiliency. The video provides more clarity on HDFS. 2.3.1 Ambari Files User View on Hortonworks Sandbox Ambari Files User View Ambari Files User View provides a user friendly interface to upload, store and move data. Underlying all components in Hadoop is the Hadoop Distributed File System(HDFS ). This is the foundation of the Hadoop cluster. The HDFS file system manages how the datasets are stored in the Hadoop cluster. It is responsible for distributing the data across the datanodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of data nodes. 2.4 Suggested Readings 3rd Concept: MapReduce amp YARN 3.1 Introduction Cluster computing faces several challenges such as how to store data persistently and keep it available if nodes fail or how to deal with node failures during a long running computation. Also there is network bottleneck which delays the time of processing data. MapReduce offers a solution by bring computation close to data thereby minimizing data movement. It is a simple programming model designed to process large volumes of data in parallel by dividing the job into a set of independent tasks. The biggest limitation with MapReduce programming is that map and reduce jobs are not stateless. This means that Reduce jobs have to wait for map jobs to be completed first. This limits maximum parallelism and therefore YARN was born as a generic resource management and distributed application framework. 3.2 Goals of the Module Understanding Map and Reduce jobs. Understanding YARN MapReduce is the key algorithm that the Hadoop data processing engine uses to distribute work around a cluster. A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability. The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce. map(key1,value) - gt listltkey2,value2gt The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem. reduce(key2, listltvalue2gt) - gt listltvalue3gt The current Apache Hadoop MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves called TaskTrackers. The JobTracker is responsible for resource management (managing the worker nodes i. e. TaskTrackers), tracking resource consumptionavailability and also job life-cycle management (scheduling individual tasks of the job, tracking progress, providing fault-tolerance for tasks etc). The TaskTracker has simple responsibilities launchteardown tasks on orders from the JobTracker and provide task-status information to the JobTracker periodically. The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. Here8217s a short video introduction to MapReduce: 3.4 Apache YARN (Yet Another Resource Negotiator) Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer in Hadoop 1x. However, the MapReduce algorithm, by itself, isn8217t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. Hadoop 2.0 presents YARN, as a generic resource-management and distributed application framework, whereby, one can implement multiple data processing applications customized for the task at hand. The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i. e. resource management and job schedulingmonitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks. ResourceManager has a pluggable Scheduler . which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, CPU, disk, network etc. NodeManager is the per-machine slave, which is responsible for launching the applications8217 containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container . Here is an architectural view of YARN: One of the crucial implementation details for MapReduce within the new YARN system that I8217d like to point out is that we have reused the existing MapReduce framework without any major surgery. This was very important to ensure compatibility for existing MapReduce applications and users. Here is a short video introduction for YARN. 3.5 Suggested Readings HDFS is one of the 4 components of Apache Hadoop the other 3 are Hadoop Common, Hadoop YARN and Hadoop MapReduce. To learn more about HDFS watch the following HDFS introduction video. To learn more about YARN watch the following YARN introduction video . 4th Concept: Hive and Pig 4.1 Introduction: Apache Hive Hive is an SQL like query language that enables those analysts familiar with SQL to run queries on large volumes of data. Hive has three main functions: data summarization, query and analysis. Hive provides tools that enable easy data extraction, transformation and loading (ETL). 4.2 Goals of the module Understanding Apache Hive Understanding Apache Tez Understanding Ambari Hiver User Views on Hortonworks Sandbox Data analysts use Hive to explore, structure and analyze that data, then turn it into business insights. Hive implements a dialect of SQL (Hive QL) that focuses on analytics and presents a rich set of SQL semantics including OLAP functions, sub-queries, common table expressions and more. Hive allows SQL developers or users with SQL tools to easily query, analyze and process data stored in Hadoop. Hive also allows programmers familiar with the MapReduce framework to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. Hive users have a choice of 3 runtimes when executing SQL queries. Users can choose between Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks as their execution backend. Here are some advantageous characteristics of Hive for enterprise SQL in Hadoop: As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance 4.3.1 How Hive Works The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data. Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets. 4.3.2 Components of Hive HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools including Pig and MapReduce to more easily read and write data on the grid. HCatalog holds a set of files paths and metadata about data in a Hadoop cluster. This allows scripts, MapReduce and Tez, jobs to be decoupled from data location and metadata like the schema. Additionally, since HCatalog also supports tools like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog external tools that want to integrate, such as Teradata Aster, can also use leverage file path location and metadata in HCatalog. At one point HCatalog was its own Apache project. However, in March, 2013, HCatalog8217s project merged with Hive. HCatalog is currently released as part of Hive. WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface. Here is a short video introduction on Hive: Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce8217s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem. Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters. Since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over end-user-facing engines such as MapReduce and Apache Spark. Tez also offers a customizable execution architecture that allows users to express complex computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it. Here is a short video introduction on Tez. 4.3.4 Stinger and Stinger. next The Stinger Initiative was started to enable Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support interactive queries all with a common SQL access layer. Stinger. next is a continuation of this initiative focused on even further enhancing the speed. scale and breadth of SQL support to enable truly real-time access in Hive while also bringing support for transactional capabilities. And just as the original Stinger initiative did, this will be addressed through a familiar three-phase delivery schedule and developed completely in the open Apache Hive community. 4.3.5 Ambari Hive User Views on Hortonworks Sandbox To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called the Ambari Hive User View. Ambari Hive User View provides an interactive interface to Hive. We can create, edit, save and run queries, and have Hive evaluate them for us using a series of MapReduce jobs or Tez jobs. Let8217s now open the Ambari Hive User View and get introduced to the environment, go to the Ambari User VIew icon and select Hive : Ambari Hive User View Now let8217s take a closer look at the SQL editing capabilities in the User View: There are five tabs to interact with SQL: Query . This is the interface shown above and the primary interface to write, edit and execute new SQL statements Saved Queries . You can save your favorite queries and quickly have access to them to rerun or edit. History . This allows you to look at past queries or currently running queries to view, edit and rerun. It also allows you to see all SQL queries you have authority to view. For example, if you are an operator and an analyst needs help with a query, then the Hadoop operator can use the History feature to see the query that was sent from the reporting tool. UDF s: Allows you to define UDF interfaces and associated classes so you can access them from the SQL editor. Upload Table . Allows you to upload your hive query tables to your preferred database and appears instantly in the Query Editor for execution. Database Explorer: The Database Explorer helps you navigate your database objects. You can either search for a database object in the Search tables dialog box, or you can navigate through Database - gt Table - gt Columns in the navigation pane. The principle pane to write and edit SQL statements. This editor includes content assist via CTRL Space to help you build queries. Content assist helps you with SQL syntax and table objects. Once you have created your SQL statement you have 4 options: Execute . This runs the SQL statement. Explain . This provides you a visual plan, from the Hive optimizer, of how the SQL statement will be executed. Save as . Allows you to persist your queries into your list of saved queries. Kill Session . Terminates the SQL statement. When the query is executed you can see the Logs or the actual query results. Logs: When the query is executed you can see the logs associated with the query execution. If your query fails this is a good place to get additional information for troubleshooting. النتائج . You can view results in sets of 50 by default. There are six sliding views on the right hand side with the following capabilities, which are in context of the tab you are in: Query . This is the default operation, which allows you to write and edit SQL. Settings . This allows you to set properties globally or associated with an individual query. Data Visualization . Allows you to visualize your numeric data through different charts. Visual Explain . This will generate an explain for the query. This will also show the progress of the query. TEZ . If you use TEZ as the query execution engine then you can view the DAG associated with the query. This integrates the TEZ User View so you can check for correctness and helps with performance tuning by visualizing the TEZ jobs associated with a SQL query. Notifications . This is how to get feedback on query execution. The Apache Hive project provides a data warehouse view of the data in HDFS. Using a SQL dialect, HiveQL (HQL), Hive lets you create summarizations of your data and perform ad-hoc queries and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with SQL. The notion of projecting a table structure on a file is often referred to as Schema-On-Read. Since you are using data in HDFS, your operations can be scaled across all the datanodes and you can manipulate huge datasets. 4.4 Introduction: Apache Pig MapReduce allows allows you to specify map and reduce functions, but working out how to fit your data processing into this pattern may sometimes require you to write multiple MapReduce stages. With Pig, data structures are much richer and the transformations you can apply to data are much more powerful. 4.4.1 Goals of this Module Understanding Apache Pig Understanding Apache Pig on Tez Understanding Ambari Pig User Views on Hortonworks Sandbox Apache Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed within YARN for access to a single dataset stored in the Hadoop Distributed File System (HDFS). Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs: Whatever the use case, Pig will be: Lab 2 - Hive and Data ETL Introduction In this tutorial, you will be introduced to Apache( TM ) Hive. In the earlier section, we covered how to load data into HDFS. So now you have geolocation and trucks files stored in HDFS as csv files. In order to use this data in Hive, we will guide you on how to create a table and how to move data into a Hive warehouse, from where it can be queried. We will analyze this data using SQL queries in Hive User Views and store it as ORC. We will also walk through Apache Tez and how a DAG is created when you specify Tez as execution engine for Hive. Let8217s start. Pre-Requisites The tutorial is a part of a series of hands on tutorials to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial. Apache Hive Apache Hive provides SQL interface to query data stored in various databases and files systems that integrate with Hadoop. Hive enables analysts familiar with SQL to run queries on large volumes of data. Hive has three main functions: data summarization, query and analysis. Hive provides tools that enable easy data extraction, transformation and loading (ETL). Step 2.1: Become Familiar with Ambari Hive View Apache Hive presents a relational view of data in HDFS. Hive can represent data in a tabular format managed by Hive or just stored in HDFS irrespective in the file format their data is stored in. Hive can query data from RCFile format, text files, ORC, JSON, parquet, sequence files and many of other formats in a tabular view. Through the use of SQL you can view your data as a table and create queries like you would in an RDBMS. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called the Ambari Hive View. Ambari Hive View provides an interactive interface to Hive. We can create, edit, save and run queries, and have Hive evaluate them for us using a series of MapReduce jobs or Tez jobs. Let8217s now open the Ambari Hive View and get introduced to the environment. Go to the 9 square Ambari User View icon and select Hive View: The Ambari Hive View looks like the following: Now let8217s take a closer look at the SQL editing capabilities in the Hive View: There are five tabs to interact with SQL: Query . This is the interface shown above and the primary interface to write, edit and execute new SQL statements Saved Queries . You can save your favorite queries and quickly have access to them to rerun or edit. History . This allows you to look at past queries or currently running queries to view, edit and rerun. It also allows you to see all SQL queries you have authority to view. For example, if you are an operator and an analyst needs help with a query, then the Hadoop operator can use the History feature to see the query that was sent from the reporting tool. UDFs . Allows you to define UDF interfaces and associated classes so you can access them from the SQL editor. Upload Table . Allows you to upload your hive query tables to your preferred database and appears instantly in the Query Editor for execution. Database Explorer: The Database Explorer helps you navigate your database objects. You can either search for a database object in the Search tables dialog box, or you can navigate through Database - gt Table - gt Columns in the navigation pane. Query Editor: The principal pane to write and edit SQL statements. This editor includes content assist via CTRL Space to help you build queries. Content assist helps you with SQL syntax and table objects. Once you have created your SQL statement you have 4 options: Execute . This runs the SQL statement. Explain . This provides you a visual plan, from the Hive optimizer, of how the SQL statement will be executed. Save as . Allows you to persist your queries into your list of saved queries. Kill Session . Terminates the SQL statement. When the query is executed you can see the Logs or the actual query results. Logs Tab: When the query is executed you can see the logs associated with the query execution. If your query fails this is a good place to get additional information for troubleshooting. Results Tab . You can view results in sets of 50 by default. There are five sliding views on the right hand side with the following capabilities, which are in context of the tab you are in: Query . This is the default operation, which allows you to write and edit SQL. Settings . This allows you to set properties globally or associated with an individual query. Data Visualization . Allows you to visualize your numeric data through different charts. Visual Explain . This will generate an explain for the query. This will also show the progress of the query. TEZ . If you use TEZ as the query execution engine then you can view the DAG associated with the query. This integrates the TEZ User View so you can check for correctness and helps with performance tuning by visualizing the TEZ jobs associated with a SQL query. Notifications . This is how to get feedback on query execution. Take a few minutes to explore the various Hive View features. 2.1.1 Set hive. execution. engine as Tez A feature we will configure before we run our hive queries is to set the hive execution engine as Tez. You can try map reduce if you like. We will use Tez in this tutorial. 1. Click on the gear in the sidebar referred to as number 6 in the interface above. 2. Click on the dropdown menu, choose hive. execution. engine and set the value as tez. Now we are ready to run our queries for this tutorial. Step 2.2: Define a Hive Table Now that you are familiar with the Hive View, let8217s create and load tables for the geolocation and trucks data. In this section we will learn how to use the Ambari Hive View to create two tables: geolocation and trucking using the Hive View Upload Table tab. The Upload Table tab provides the following key options: choose input file type, storage options (i. e. Apache ORC) and set first row as header. Here is a visual representation of the table and load creation process accomplish in the next few steps.: 2.2.1 Create and load Trucks table For Staging Initial Load Navigate and select the Upload Table of the Ambari Hive View. Then select the Upload from HDFS radio button, enter the HDFS path usermariadevdatatrucks. csv and click the Preview button: You should see a similar dialog: Note that the first row contains the names of the columns. Fortunately the Upload Table tab has a feature to specify the first row as a header for the column names. Press the Gear Button next to the File type pull down menu, shown above, to file type customization window. Then check the checkbox for the Is first row header and hit the close button. You should now see a similar dialog box with the names of the header columns as the names of the columns: Once you have finished setting all the various properties select the Upload Table button to start the create and load table process. Before reviewing what is happening behind the covers in the Upload Progress let8217s learn learn more about Hive File Formats. 2.2.2: Define an ORC Table in Hive Create table using Apache ORC file format Introducing Apache ORC is a fast columnar storage file format for Hadoop workloads. The Optimized Row Columnar (new Apache ORC project ) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. To use the ORC format, specify ORC as the file format when creating the table. Here is an example:: Similar style create statements are used with the temporary tables used in the Upload Tables tab. 2.2.3: Review Upload Table Progress Steps Initially the trucks. csv table is created and loaded into a temporary table. The temporary table is used to create and load data in ORC format using syntax explained in previous step. Once the data is loaded into final table the temporary tables are deleted. NOTE: The temporary table names are random set of characters and not the names in the illustration above. You can review the SQL statements issued by selecting the History tab and clicking on the 4 Internal Job that were executed as a result of using the Upload Table tab. 2.2.4 Create and Load Geolocation Table Repeat the steps above with the geolocation. csv file to create and load the geolocation table using the ORC file format. 2.2.5 Hive Create Table Statement Let8217s review some aspects of the CREATE TABLE statements generated and issued above. If you have an SQL background this statement should seem very familiar except for the last 3 lines after the columns definition: The ROW FORMAT clause specifies each row is terminated by the new line character. The FIELDS TERMINATED BY clause specifies that the fields associated with the table (in our case, the two csv files) are to be delimited by a comma. The STORED AS clause specifies that the table will be stored in the TEXTFILE format. NOTE: For details on these clauses consult the Apache Hive Language Manual . 2.2.6 Verify New Tables Exist To verify the tables were defined successfully, click the refresh icon in the Database Explorer. Under Databases, click default database to expand the list of table and the new tables should appear: 2.2.7 Sample Data from the trucks table Click on the Load sample data icon to generate and execute a select SQL statement to query the table for a 100 rows. You can have multiple SQL statements within each editor worksheet, but each statement needs to be separated by a semicolon 82208221 . If you have multiple statements within a worksheet but you only want to run one of them just highlight the statement you want to run and then click the Execute button. A few additional commands to explore tables: show tables 8211 List the tables created in the database by looking up the list of tables from the metadata stored in HCatalogdescribe describe 8211 Provides a list of columns for a particular table (ie describe geolocationstage ) show create 8211 Provides the DDL to recreate a table (ie show create table geolocationstage ) describe formatted 8211 Explore additional metadata about the table. For example you can verify geolocation is an ORC Table, execute the following query: Scroll down to the bottom of the Results tab and you will see a section labeled Storage Information. The output should look like: By default, when you create a table in Hive, a directory with the same name gets created in the appshivewarehouse folder in HDFS. Using the Ambari Files View, navigate to the appshivewarehouse folder. You should see both a geolocation and trucks directory: NOTE: The definition of a Hive table and its associated metadata (i. e. the directory the data is stored in, the file format, what Hive properties are set, etc.) are stored in the Hive metastore, which on the Sandbox is a MySQL database. 2.2.8 Rename Query Editor Worksheet Notice the tab of your new Worksheet is labeled trucks sample data . Double-click on the worksheet tab to rename the label to 8220sample truck data8221. Now save this worksheet by clicking the button. 2.2.9 Command Line Approach: Populate Hive Table with Data The following Hive command can be used to LOAD data into existing table from the command line If you would run the above commands and navigate to the usermariadevdata folder. You would of notice the folder is empty The LOAD DATA INPATH command moved the trucks. csv file from the usermariadevdata folder to the appshivewarehousetrucksstage folder . 2.2.10 Beeline 8211 Command Shell If you want to try running some of these commands from the the command line you can use the Beeline Shell. Beeline uses a JDBC connection to connect to HiveServer2. Follow the following steps from your shell in the box (or putty if using Windows): i. Local Sandbox VM Open up shell in the box to ssh into HDP with iii. beeline Starts Beeline shell and now you can enter commands and SQL iv. quit Exits out of the Beeline shell. What did you notice about performance after running hive queries from shell Queries using the shell run faster because hive runs the query directory in hadoop whereas in Ambari Hive View, the query must be accepted by a rest server before it can submitted to hadoop. You can get more information on the Beeline from the Hive Wiki . Beeline is based on SQLLine . Step 2.3: Explore Hive Settings on Ambari Dashboard 2.3.1 Open Ambari Dashboard in New Tab Click on the Dashboard tab to start exploring the Ambari Dashboard. 2.3.2 Become Familiar with Hive Settings Go to the Hive page then select the Configs tab then click on Settings tab : Once you click on the Hive page you should see a page similar to above: Scroll down to the Optimization Settings : In the above screenshot we can see: Tez is set as the optimization engine Cost Based Optimizer (CBO) is turned on This shows the HDP 2.5 Ambari Smart Configurations . which simplifies setting configurations Hadoop is configured by a collection of XML files . In early versions of Hadoop, operators would need to do XML editing to change settings . There was no default versioning. Early Ambari interfaces made it easier to change values by showing the settings page with dialog boxes for the various settings and allowing you to edit them. However, you needed to know what needed to go into the field and understand the range of values. Now with Smart Configurations you can toggle binary features and use the slider bars with settings that have ranges. By default the key configurations are displayed on the first page. If the setting you are looking for is not on this page you can find additional settings in the Advanced tab: For example, if we wanted to improve SQL performance . we can use the new Hive vectorization features . These settings can be found and enabled by following these steps: Click on the Advanced tab and scroll to find the property Or, start typing in the property into the property search field and then this would filter the setting you scroll for. As you can see from the green circle above, the Enable Vectorization and Map Vectorization is turned on already. Some key resources to learn more about vectorization and some of the key settings in Hive tuning: Step 2.4: Analyze the Trucks Data Next we will be using Hive, Pig and Zeppelin to analyze derived data from the geolocation and trucks tables. The business objective is to better understand the risk the company is under from fatigue of drivers, over-used trucks, and the impact of various trucking events on risk. In order to accomplish this, we will apply a series of transformations to the source data, mostly though SQL, and use Pig or Spark to calculate risk. In the last lab on Data Visualization, we will be using Zeppelin to generate a series of charts to better understand risk . Let8217s get started with the first transformation. We want to calculate the miles per gallon for each truck . We will start with our truck data table . We need to sum up all the miles and gas columns on a per truck basis . Hive has a series of functions that can be used to reformat a table. The keyword LATERAL VIEW is how we invoke things. The stack function allows us to restructure the data into 3 columns labeled rdate, gas and mile (ex: 8216june138217, june13miles, june13gas) that make up a maximum of 54 rows. We pick truckid, driverid, rdate, miles, gas from our original table and add a calculated column for mpg (milesgas). And then we will calculate average mileage . 2.4.1 Create Table truckmileage From Existing Trucking Data Using the Ambari Hive User View, execute the following query: 2.4.2 Explore a sampling of the data in the truckmileage table To view the data generated by the script, click Load Sample Data icon in the Database Explorer next to truckmileage. After clicking the next button once, you should see a table that lists each trip made by a truck and driver : 2.4.3 Use the Content Assist to build a query 1. Create a new SQL Worksheet . 2. Start typing in the SELECT SQL command . but only enter the first two letters: 3. Press Ctrlspace to view the following content assist pop-up dialog window: NOTE: Notice content assist shows you some options that start with an 8220SE8221. These shortcuts will be great for when you write a lot of custom query code. 4. Type in the following query, using Ctrlspace throughout your typing so that you can get an idea of what content assist can do and how it works: 5. Click the 8220 Save as 8221 button to save the query as 8220 average mpg 8220: 6. Notice your query now shows up in the list of 8220 Saved Queries 8220, which is one of the tabs at the top of the Hive User View. 7. Execute the 8220 average mpg 8221 query and view its results. 2.4.4 Explore Explain Features of the Hive Query Editor 1. Now let8217s explore the various explain features to better understand the execution of a query . Text Explain, Visual Explain and Tez Explain. Click on the Explain button: 2. You shall receive similar image as below. The following output displays the flow of the resulting Tez job: 3. To see the Visual Explain, click on the Visual Explain icon on the right tabs. This is a much more readable summary of the explain plan: 2.4.5 Explore TEZ 1. If you click on TEZ View from Ambari Views at the top, you can see DAG details associated with the previous hive and pig jobs. 2. Select the first DAG as it represents the last job that was executed. 3. There are seven tabs at the top left please take a few minutes to explore the various tabs and then click on the Graphical View tab and hover over one of the nodes with your cursor to get more details on the processing in that node. 4. Let8217s also view Vertex Swimlane . This feature helps with troubleshooting of TEZ jobs. As you will see in the image there is a graph for Map 1 and Reduce 2. These graphs are timelines for when events happened. Hover over red or blue line to view a event tooltip. Bubble represents an event Vertex represents the solid line, timeline of events For map1, the tooltip shows that the events vertex started and vertex initialize occur simultaneously: For Reducer 2, the tooltip shows that the events vertex started and initialize share 1 second difference on execution time. When you look at the tasks started for and finished (thick line) for Map1 compared to Reducer2 in the graph, what do you notice Map1 starts and completes before Reducer2. 5. Go back to the Hive View and save the query by clicking the Save as button. 2.4.6 Create Table truck avgmileage From Existing trucksmileage Data Note: Verify that the hive. execution. engine is under tez . Persist these results into a table . this is a fairly common pattern in Hive and it is called Create Table As Select (CTAS ). Paste the following script into a new Worksheet, then click the Execute button: 2.4.7 Load Sample Data of avgmileage To view the data generated by the script, click Load sample data icon in the Database Explorer next to avgmileage. You see our table is now a list of each trip made by a truck. Step 2.5: Define Table Schema Now we have refined the truck data to get the average mpg for each truck ( avgmileage table ). The next task is to compute the risk factor for each driver which is the total miles drivenabnormal events . We can get the event information from the geolocation table. If we look at the truckmileage table, we have the driverid and the number of miles for each trip. To get the total miles for each driver, we can group those records by driverid and then sum the miles. 2.5.1 Create Table DriverMileage from Existing truckmileage Data We will start by creating a table named drivermileage that is created from a query of the columns we want from truckmileage. The following query groups the records by driverid and sums the miles in the select statement. Execute the query below in a new Worksheet: Note: This table is essential for both Pig Latin and Spark jobs. 2.5.2 View Data Generated by Query To view data, click the Load sample data icon in the Database Explorer next to drivermileage. The results should look like: 2.5.3 Explore Hive Data Visualization This tool enables us to transform our hive data into a visualization that makes data easier to understand. Let8217s explore the Hive data explorer to see a variety of different data visualizations. We8217ll use these examples to build a custom visualization, which will show the user 1. Issue a query by (1)clicking on the geolocation Load sample data icon and then (2) select the Hive View visualization tab 2. Click on Data Explorer tab and quickly explore the distribution of the data from the query. 3. You can also explore some custom Data Visualizations by click the tab and then dragging 2 columns into the Positional fields. Note that you can not save these graphs. Explore the following HCC article for more info. Congratulations Let8217s summarize some Hive commands we learned to process, filter and manipulate the geolocation and trucks data. We now can create Hive tables with CREATE TABLE and load data into them using the LOAD DATA INPATH command. Additionally, we learned how to change the file format of the tables to ORC, so hive is more efficient at reading, writing and processing this data. We learned to grab parameters from our existing table using SELECT FROM to create a new filtered table. Suggested Readings Augment your hive foundation with the following resources: Tutorial QampA and Reporting Issues If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don8217t find your answer you can post a new HCC question for this tutorial using the Ask Questions button. Tutorial Name: Hello HDP An Introduction to Hadoop with Hive and Pig If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab. All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Contribution Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it Lab 3 - Pig Risk Factor Analysis Introduction In this tutorial, you will be introduced to Apache Pig. In the earlier section of lab, you learned how to load data into HDFS and then manipulate it using Hive. We are using the Truck sensor data to better understand risk associated with every driver. This section will teach you to compute risk using Apache Pig . Pre-Requisites The tutorial is a part of series of hands on tutorial to get you started on HDP using Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial. Hortonworks Sandbox Learning the Ropes of the Hortonworks Sandbox Lab 1: Load sensor data into HDFS Lab 2: Data Manipulation with Apache Hive Allow yourself around one hour to complete this tutorial. Pig Basics Pig is a high-level scripting language used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig8217s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig. Through the User Defined Functions (UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java . You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems. Pig works with data from many sources, including structured and unstructured data . and store the results into the Hadoop Data File System. Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster . Create Table riskfactor from Existing trucksmileage Data Next, you will use Pig to compute the risk factor of each driver. Before we can run the Pig code . the table must already exist in Hive to satisfy one of the requirements for the HCatStorer() class . The Pig code expects the following structure for a table named riskfactor . Execute the following DDL command: Verify Table riskfactor was Created Successfully Verify the riskfactor table was created successfully. It will be empty now, but you will populate it from a Pig script. You are now ready to compute the risk factor using Pig. Let8217s take a look at Pig and how to execute Pig scripts from within Ambari. Step 3.1: Create Pig Script In this phase of the tutorial, we create and run a Pig script. We will use the Ambari Pig View. Let8217s get started 3.1.1 Log in to Ambari Pig User Views To get to the Ambari Pig View, click on the Ambari Views icon at top right and select Pig : This will bring up the Ambari Pig User View interface. Your Pig View does not have any scripts to display, so it will look like the following: On the left is a list of your scripts, and on the right is a composition box for writing scripts. A special interface feature is the Pig helper located below the name of your script file. The Pig helper provides us with templates for the statements, functions, IO statements, HCatLoader() and Python user defined functions. At the very bottom are status areas that will show the results of our script and log files. The following screenshot shows and describes the various components and features of the Pig View: 3.1.2 Create a New Script Let8217s enter a Pig script. Click the New Script button in the upper-right corner of the view: Name the script riskfactor. pig . then click the Create button: 3.1.3 Load Data in Pig using Hcatalog We will use HCatalog to load data into Pig . HCatalog allows us to share schema across tools and users within our Hadoop environment. It also allows us to factor out schema and location information from our queries and scripts and centralize them in a common repository . Since it is in HCatalog we can use the HCatLoader() function . Pig allows us to give the table a name or alias and not have to worry about allocating space and defining the structure. We just have to worry about how we are processing the table. We can use the Pig helper located below the name of your script file to give us a template for the line. Click on Pig helper - gt HCatalog-gtload template The entry TABLE is highlighted in red for us. Type the name of the table which is geolocation. Remember to add the a before the template. This saves the results into a. Note the 82168217 has to have a space before and after it. Our completed line of code will look like: The script above loads data, in our case, from a file named geolocation using the HCatLoader() function. Copy-and-paste the above Pig code into the riskfactor. pig window. 3.1.4 Filter your data set The next step is to select a subset of the records . so we have the records of drivers for which the event is not normal . To do this in Pig we use the Filter operator . We instruct Pig to Filter our table and keep all records where event 8221normal8221 and store this in b. With this one simple statement, Pig will look at each record in the table and filter out all the ones that do not meet our criteria. We can use Pig Help again by clicking on Pig helper-gtRelational Operators-gtFILTER template We can replace VAR with 8220a8221 (hint: tab jumps you to the next field) Our COND is 8220 event 8217normal8217 8221 (note: single quotes are needed around normal and don8217t forget the trailing semi-colon) Complete line of code will look like: Copy-and-paste the above Pig code into the riskfactor. pig window. 3.1.5 Iterate your data set Since we have the right set of records, let8217s iterate through them. We use the 8220foreach8221 operator on the grouped data to iterate through all the records. We would also like to know the number of non normal events associated with a driver . so to achieve this we add 821618217 to every row in the data set. Pig helper - gtRelational Operators-gtFOREACH template will get us the code Our DATA is b and the second NEWDATA is 8220 driverid, event,(int) 821618217 as occurance 8220 Complete line of code will look like: Copy-and-paste the above Pig code into the riskfactor. pig window: 3.1.6 Calculate the total non normal events for each driver The group statement is important because it groups the records by one or more relations . In our case, we want to group by driver id and iterate over each row again to sum the non normal events. Pig helper - gtRelational Operators-gtGROUP VAR BY VAR template will get us the code First VAR takes 8220c8221 and second VAR takes 8220 driverid 8220 Complete line of code will look like: Copy-and-paste the above Pig code into the riskfactor. pig window. Next use Foreach statement again to add the occurance. 3.1.7 Load drivermileage Table and Perform a Join Operation In this section, we will load drivermileage table into Pig using Hcatlog and perform a join operation on driverid. The resulting data set will give us total miles and total non normal events for a particular driver. Load drivermileage using HcatLoader() Pig helper - gtRelational Operators-gtJOIN VAR BY template will get us the code Replace VAR by 8216 e 8216 and after BY put 8216 driverid, g by driverid 8216 Complete line of code will look like: Copy-and-paste the above two Pig codes into the riskfactor. pig window. 3.1.8 Compute Driver Risk factor In this section, we will associate a driver risk factor with every driver. To calculate driver risk factor . divide total miles travelled by non normal event occurrences . We will use Foreach statement again to compute driver risk factor for each driver. Use the following code and paste it into your Pig script. As a final step, store the data into a table using Hcatalog . Here is the final code and what it will look like once you paste it into the editor. Geolocation has data stored in ORC format Save the file riskfactor. pig by clicking the Save button in the left-hand column. Step 3.2: Quick Recap Before we execute the code, let8217s review the code again: The line a loads the geolocation table from HCatalog. The line b filters out all the rows where the event is not 8216Normal8217. Then we add a column called occurrence and assign it a value of 1. We then group the records by driverid and sum up the occurrences for each driver. At this point we need the miles driven by each driver, so we load the table we created using Hive. To get our final result, we join by the driverid the count of events in e with the mileage data in g. Now it is real simple to calculate the risk factor by dividing the miles driven by the number of events You need to configure the Pig Editor to use HCatalog so that the Pig script can load the proper libraries. In the Pig arguments text box, enter - useHCatalog and click the Add button: Note this argument is case sensitive . It should be typed exactly 8220-useHCatalog8221. Step 3.3: Execute Pig Script on Tez 3.3.1 Execute Pig Script Click Execute on Tez checkbox and finally hit the blue Execute button to submit the job. Pig job will be submitted to the cluster. This will generate a new tab with a status of the running of the Pig job and at the top you will find a progress bar that shows the job status. 3.3.2 View Results Section Wait for the job to complete. The output of the job is displayed in the Results section. Notice your script does not output any result it stores the result into a Hive table so your Results section will be empty. Click on the Logs dropdown menu to see what happened when your script ran. Errors will appear here. 3.3.3 View Logs section (Debugging Practice) Why are Logs important The logs section is helpful when debugging code after expected output does not happen. For instance, say in the next section, we load the sample data from our riskfactor table and nothing appears. Logs will tell us why the job failed. A common issue that could happen is that pig does not successfully read data from the geolocation table or drivermileage table. Therefore, we can effectively address the issue. Let8217s verify pig read from these tables successfully and stored the data into our riskfactor table. You should receive similar output: What results do our logs show us about our Pig Script Read 8000 records from our geolocation table Read 100 records from our drivermileage table Stored 99 records into our riskfactor table Can you think of scenarios in which these results if different would help us debug our script For example, say 0 records were read from the geolocation table, how would you solve the problem 3.3.4 Verify Pig Script Successfully Populated Hive Table Go back to the Ambari Hive User View and browse the data in the riskfactor table to verify that your Pig job successfully populated this table. Here is what is should look like: At this point we now have our truck miles per gallon table and our risk factor table. The next step is to pull this data into Excel to create the charts for the visualization step. Congratulations Let8217s summarize the Pig commands we learned in this tutorial to compute risk factor analysis on the geolocation and truck data. We learned to use Pig to access the data from Hive using the LOAD HCatLoader() script. Therefore, we were able to perform the filter . foreach . المجموعة. join . and store HCatStorer() scripts to manipulate, transform and process this data. To review these bold pig latin operators, view the Pig Latin Basics. which contains documentation on each operator. Suggested Readings Strengthen your foundation of pig latin and reinforce why this scripting platform is benficial for processing and analyzing massive data sets with these resources: Tutorial QampA and Reporting Issues If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don8217t find your answer you can post a new HCC question for this tutorial using the Ask Questions button. Tutorial Name: Hello HDP An Introduction to Hadoop with Hive and Pig If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab. All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Contribution Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it Using Apache Spark to compute Driver Risk Factor Note: This lab is optional and produces the same result as in Lab 3. You may continue on to the next lab if you wish. Introduction In this tutorial we will introduce Apache Spark. In the earlier section of the lab you have learned how to load data into HDFS and then manipulate it using Hive. We are using the Truck sensor data to better understand risk associated with every driver. This section will teach you how to compute risk using Apache spark. Pre-Requisites This tutorial is a part of a series of hands on tutorials to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial. Hortonworks Sandbox Learning the Ropes of the Hortonworks Sandbox Lab 1: Loading sensor data into HDFS Lab 2: Data Manipulation with Apache Hive Allow yourself around one hour to complete this tutorial. Background MapReduce has been useful, but the amount of time it takes for the jobs to run can at times be exhaustive. Also, MapReduce jobs only work for a specific set of use cases. There is a need for computing framework that works for a wider set of use cases. Apache Spark was designed to be a fast, general-purpose, easy-to-use computing platform. It extends the MapReduce model and takes it to a whole other level. The speed comes from the in-memory computations. Applications running in memory allow for much faster processing and response. Apache Spark Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java. and Python and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets. Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise. You can run batch application such as MapReduce types jobs or iterative algorithms that build upon each other. You can also run interactive queries and process streaming data with your application. Spark also provides a number of libraries which you can easily use to expand beyond the basic Spark capabilities such as Machine Learning algorithms, SQL, streaming, and graph processing. Spark runs on Hadoop clusters such as Hadoop YARN or Apache Mesos, or even in a Standalone Mode with its own scheduler. The Sandbox includes both Spark 1.6 and Spark 2.0. Let8217s get started Step 4.1: Configure Spark services using Ambari 1. Log on to Ambari Dashboard as mariadev. At the bottom left corner of the services column, check that Spark and Zeppelin are running. Note: If these services are disabled, start these services. For HDP 2.5 Sandbox Users Activate Livy Server Livy Server is a new feature added to the latest Sandbox HDP Platform and it adds extra security while running our spark jobs from Zeppelin Notebook. For this lab, users that have HDP 2.5 Sandbox can use Livy. 2. Now verify the Spark livy server is running: 3. As you can see our server is down. We need to start it before running spark jobs in Zeppelin. Click on Livy Server. then click on sandbox. hortonworks . Now we let8217s scroll down to livy server. press on the Stopped button and start the server. Press the OK button in the Confirmation window. Livy Server Started: 4. Go back into the Spark Service. Click on Service Actions - gt Turn Off Maintenance Mode . Log out of Ambari. 5. Access Zeppelin at sandbox. hortonworks:9995 through its port number: Refer to Learning the Ropes of Hortonworks Sandbox if you need assistance figuring out your hostname. You should see a Zeppelin Welcome Page: Optionally, if you want to find out how to access the Spark shell to run code on Spark refer to Appendix A . 6. Create a Zeppelin Notebook Click on a Notebook tab at the top left and hit Create new note . Name your notebook Compute Riskfactor with Spark. By the default, the notebook will load Spark Scala API. Step 4.2: Create a HiveContext For improved Hive integration, HDP 2.5 offers ORC file support for Spark. This allows Spark to read data stored in ORC files. Spark can leverage ORC file8217s more efficient columnar storage and predicate pushdown capability for even faster in-memory processing. HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. The more basic SQLContext provides a subset of the Spark SQL support that does not depend on Hive. It reads the configuration for Hive from hive-site. xml on the classpath. Import sql libraries: If you have gone through Pig section, you have to drop the table riskfactor so that you can populate it again using Spark. Copy and paste the following code into your Zeppelin notebook, then click the play button. Alternatively, press shiftenter to run the code. We will see that there is a table called riskfactor. let us drop that: To verify, let us do show tables again: Now create it back with the same DDL that we executed in the Pig section, Write the following query: We can either run the original spark interpreter or the livy spark interpreter to run spark code. The difference is that livy comes with more security. The default interpreter for spark jobs is spark . Instantiate HiveContext sc stands for Spark Context . SparkContext is the main entry point to everything Spark. It can be used to create RDDs and shared variables on the cluster. When you start up the Spark Shell, the SparkContext is automatically initialized for you with the variable sc . Step 4.3: Create a RDD from HiveContext Spark8217s primary core abstraction is called a Resilient Distributed Dataset or RDD. It is a distributed collection of elements that is parallelized across the cluster. In other words, a RDD is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel. There are three methods for creating a RDD: Parallelize an existing collection. This means that the data already resides within Spark and can now be operated on in parallel. Create a RDD by referencing a dataset. This dataset can come from any storage source supported by Hadoop such as HDFS, Cassandra, HBase etc. Create a RDD by transforming an existing RDD to create a new RDD. We will be using the later two methods in our tutorial. RDD Transformations and Actions Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster. Once a RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions . Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing DAG that can then be applied on the partitioned dataset across the YARN cluster. Transformations do not return a value. In fact, nothing is evaluated during the definition of these transformation statements. Spark just creates these Direct Acyclic Graphs or DAG, which will only be evaluated at runtime. We call this lazy evaluation. An Action operation, on the other hand, executes a DAG and returns a value. 4.3.1 View List of Tables in Hive Warehouse Use a simple show command to see the list of tables in Hive warehouse. Note: false indicates whether the column requires data. You will notice that the geolocation table and the driver mileage table that we created earlier in an tutorial are already listed in Hive metastore and can be directly queried upon. 4.3.2 Query Tables To Build Spark RDD We will do a simple select query to fetch data from geolocation and drivermileage tables to a spark variable. Getting data into Spark this way also allows to copy table schema to RDD. 4.4 Querying Against a Table 4.4.1 Registering a Temporary Table Now let8217s register a temporary table and use SQL syntax to query against that table. Next, we will perform an iteration and a filter operation. First, we need to filter drivers that have non-normal events associated with them and then count the number for non-normal events for each driver. As stated earlier about RDD transformations, select operation is a RDD transformation and therefore does not return anything. The resulting table will have a count of total non-normal events associated with each driver. Register this filtered table as a temporary table so that subsequent SQL queries can be applied to it. You can view the result by executing an action operation on the RDD. 4.4.2 Perform join Operation In this section we will perform a join operation geolocationtemp2 table has details of drivers and count of their respective non-normal events. drivermileagetemp1 table has details of total miles travelled by each driver. We will join two tables on common column, which in our case is driverid . The resulting data set will give us total miles and total non-normal events for a particular driver. Register this filtered table as a temporary table so that subsequent SQL queries can be applied to it. You can view the result by executing action operation on RDD. 4.4.3 Compute Driver Risk Factor In this section we will associate a driver risk factor with every driver. Driver risk factor will be calculated by dividing total miles travelled by non-normal event occurrences. The resulting data set will give us total miles and total non normal events and what is a risk for a particular driver. Register this filtered table as a temporary table so that subsequent SQL queries can be applied to it. Step 4.5: Load and Save Data into Hive as ORC In this section we store data in a smart ORC (Optimized Row Columnar) format using Spark. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads and with integrated support for finding required rows fast. Storing data in a columnar format lets the reader read, decompress, and process only the values required for the current query. Because ORC files are type aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is persisted. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions. 4.5.1 Create an ORC table Create a table and store it as ORC. Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. Note: toDF() creates a DataFrame with columns driverid String, occurance bigin, etc. 4.5.2 Convert data into ORC table Before we load the data into hive table that we created above, we will have to convert our data file into ORC format too. Note: For Spark 1.4.1 and higher, use If you used the above script, skip the following instruction and move to 4.5.3. Note: For Spark 1.3.1, use 4.5.3 Load the data into Hive table using load data command 4.5.4 Create the final table Riskfactor using CTAS 4.5.5 Verify Data Successfully Populated Hive Table in Hive (Check 2) Execute a select query to verify your table has been successfully stored. You can go to Ambari Hive user view to check whether the Hive table you created has the data populated in it. Hive riskfactor table populated Did both tables have the same data up to 10 rows Full Spark Code Review for Lab Import hive and sql libraries Shows tables in the default hive database Select all rows and columns from tables, stores hive script into variable and registers variables as RDD Load first 10 rows from geolocationtemp2, which is the data from drivermileage table Create joined to join 2 tables by the same driverid and register joined as a RDD Load first 10 rows and columns in joined Initialize riskfactorspark and register as an RDD Print the first 10 lines from the riskfactorspark table Create table finalresults in Hive, save it as ORC, load data into it, and then create the final table called riskfactor using CTAS Appendix A: Run Spark Code in the Spark Interactive Shell 1) Open your terminal or putty. SSH into the Sandbox using root as login and hadoop as password. Optionally, if you don8217t have an SSH client installed and configured you can use the built-in web client which can be accessed from here: host:4200 (use the same username and password provided above) 2) Let8217s enter the Spark interactive shell (spark repl). Type the command This will load the default Spark Scala API. Note: Hive comes preconfigured with HDP Sandbox. The coding exercise we just went through can be also completed using a Spark shell. Just as we did in Zeppelin, you can copy and paste the code. Congratulations Let8217s summarize the spark coding skills and knowledge we acquired to compute risk factor associated with every driver. Apache Spark is efficient for computation because of its in-memory data processing engine . We learned how to integrate hive with spark by creating a Hive Context . We used our existing data from Hive to create an RDD . We learned to perform RDD transformations and actions to create new datasets from existing RDDs. These new datasets include filtered, manipulated and processed data. After we computed risk factor . we learned to load and save data into Hive as ORC . Suggested Readings Tutorial QampA and Reporting Issues If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don8217t find your answer you can post a new HCC question for this tutorial using the Ask Questions button. Tutorial Name: Hello HDP An Introduction to Hadoop with Hive and Pig If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab. All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Contribution Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it Lab 5 - Data Reporting With Zeppelin Introduction In this tutorial you will be introduced to Apache Zeppelin. In the earlier section of lab, you learned how to perform data visualization using Excel. This section will teach you to visualize data using Zeppelin. Prerequisites The tutorial is a part of series of hands on tutorial to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial. Hortonworks Sandbox Learning the Ropes of the Hortonworks Sandbox Lab 1: Load sensor data into HDFS Lab 2: Data Manipulation with Apache Hive Lab 3: Use Pig to compute Driver Risk Factor Lab 4: Use Spark to compute Driver Risk Factor Working Zeppelin installation Allow yourself approximately one hour to complete this tutorial. Apache Zeppelin Apache Zeppelin provides a powerful web-based notebook platform for data analysis and discovery. Behind the scenes it supports Spark distributed contexts as well as other language bindings on top of Spark. In this tutorial we will be using Apache Zeppelin to run SQL queries on our geolocation, trucks, and riskfactor data that we8217ve collected earlier and visualize the result through graphs and charts. NOTE: We can also run queries via various interpreters for the following (but not limited to) spark, hawq and postgresql. Step 5.1: Create a Zeppelin Notebook 5.1.1 Navigate to Zeppelin Notebook 1) Navigate to sandbox. hortonworks:9995 directly to open the Zeppelin interface. 2) Click on create note, name the notebook Driver Risk Factor and a new notebook shall get started. Step 5.2: Execute a Hive Query 5.2.1 Visualize finalresults Data in Tabular Format In the previous Spark and Pig tutorials you already created a table finalresults or riskfactor which gives the risk factor associated with every driver. We will use the data we generated in this table to visualize which drivers have the highest risk factor. We will use the jdbc hive interpreter to write queries in Zeppelin. jdbc by default runs hive. 1) Copy and paste the code below into your Zeppelin note. 2) Click the play button next to 8220ready8221 or 8220finished8221 to run the query in the Zeppelin notebook. Alternative way to run query is 8220shiftenter.8221 Initially, the query will produce the data in tabular format as shown in the screenshot. Step 5.3: Build Charts using Zeppelin 5.3.1 Visualize finalresults Data in Chart Format 1. Iterate through each of the tabs that appear underneath the query. Each one will display a different type of chart depending on the data that is returned in the query. 2. After clicking on a chart, we can view extra advanced settings to tailor the view of the data we want 3. Click settings to open the advanced chart features. 4. To make a chart with riskfactor. driverid and riskfactor. riskfactor SUM. drag the table relations into the boxes as shown in the image below. 5. You should now see an image like the one below. 6. If you hover on the peaks, each will give the driverid and riskfactor. 7. Try experimenting with the different types of charts as well as dragging and dropping the different table fields to see what kind of results you can obtain. 8. Let8217 try a different query to find which cities and states contain the drivers with the highest riskfactors. 9. Run the query above using the keyboard shortcut ShiftEnter. You should eventually end up with the results in a table below. 10. After changing a few of the settings we can figure out which of the cities have the high risk factors. Try changing the chart settings by clicking the scatterplot icon. Then make sure that the keys a. driverid is within the xAxis field, a. riskfactor is in the yAxis field, and b. city is in the group field. The chart should look similar to the following. The graph shows that driver id number A39 has a high risk factor of 652417 and drives in Santa Maria. Now that we know how to use Apache Zeppelin to obtain and visualize our data, we can use the skills we8217ve learned from our Hive, Pig, and Spark labs, as well and apply them to new kinds of data to try to make better sense and meaning from the numbers Suggested Readings Tutorial QampA and Reporting Issues If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don8217t find your answer you can post a new HCC question for this tutorial using the Ask Questions button. Tutorial Name: Hello HDP An Introduction to Hadoop with Hive and Pig If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab. All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Contribution Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it Lab 6: Data Reporting with Excel Introduction This step is optional as it requires you to have Excel and Power View, however feel free to connect from any reporting tool to do a similar exercise. In this section, we will use Microsoft Excel Professional Plus 2013 to access the refined data. We will be using the ODBC connection. Prerequisites The tutorial is a part of series of hands on tutorial to get you started on HDP using Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial. Lab 0: (Hortonworks sandbox set up) Lab 1: Loading sensor data into HDFS Lab 2: Data Manipulation with Apache Hive Lab 3: Use Pig to compute Driver Risk Factor Lab 4: Use Spark to compute Driver Risk Factor Please configure ODBC drivers on your system with the help of following tutorial: Installing and Configuring the Hortonworks ODBC driver on Windows 7 Allow yourself around half an hour to complete this tutorial. Step 6.b.1: Access the Refined Data with Microsoft Excel The Hive ODBC driver can be found at the Hortonworks Add-on page. For Windows ODBC driver setup follow these instructions. Open the ODBC connection manager and open the connection you setup up. It should look like this. 1) Open a new blank workbook. Select Data gt From Other Sources gt From Microsoft Query . 2) On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK . The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC. We will import the avgmileage table. 3) Accept the defaults for everything and click through till you hit the Finish button. After you click on Finish, Excel will send the data request over to Hadoop. It will take awhile for this to happen. When the data is returned it will ask you to place the data in the workbook. We want to be in cell A1 like this. 4) The data is placed you will see the avgmileage table imported into your spreadsheet. Step 6.b.2: Visualize Data with Microsoft Excel 1) So now we are going to insert a Power View report. Follow this link to set up the Power View Report if you do not have it. This will create a new tab in your workbook with the data inserted in the Power View page. 2) Select the design tab at the top and then select a column chart and use the stacked column version in the drop down menu. This will give you a bar chart. Grab the lower right of the chart and stretch it out to the full pane. Close the filter tab and the chart will expand and look like this. 3) So to finish off the tutorial I am going to create a map of the events reported in the geolocation table. I will show you how you can build up the queries and create a map of the data on an ad hoc basis. 4) For a map we need location information and a data point. Looking at the geolocation table I will simply plot the location of each of the events. I will need the driverid, city and state columns from this table. We know that the select statement will let me extract these columns. So to start off I can just create the select query in the Query Editor. 5) Query subset of geolocation columns 6) After I execute the query I see what results are returned. In a more complex query you can easily make changes to the query at this point till you get the right results. So the results I get back look like this. 7) Since my results look fine I now need to capture the result in a table. So I will use the select statement as part of my CTAS (create table select as) pattern. I will call the table events and the query now looks like this. Create table avgmileage from existing trucksmileage data 8) I can execute the query and the table events gets created. As we saw earlier I can go to Excel and import the table into a blank worksheet. The imported data will look like this. 9) Now I can insert the PowerView tab in the Excel workbook. To get a map I just select the Design tab at the top and select the Map button in the menu bar. 10) Make sure you have a network connection because Power View using Bing to do the geocoding which translates the city and state columns into map coordinates. If we just want to see where events took place we can uncheck the driverid. The finished map looks like this. We8217ve shown how the Hortonworks Data Platform (HDP) can store and analyze geolocation data. In addition I have shown you a few techniques on building your own queries. You can easily plot risk factor and miles per gallon as bar charts. I showed you the basics of creating maps. A good next step is to only plot certain types of events. Using the pattern I gave you it is pretty straightforward to extract the data and visualize it in Excel. Next Steps: Try These Congratulations on finishing a comprehensive series on Hadoop and HDP. By now you should have a good understanding on fundamentals of Hadoop and its related ecosystem such as Map Reduce, YARN, HDFS, Hive, Pig and Spark. As a Hadoop practitioner you can choose three basic personas to build upon your skill: Case Studies: Learn more about Hadoop through these case studies: Suggested Readings Tutorial QampA and Reporting Issues If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don8217t find your answer you can post a new HCC question for this tutorial using the Ask Questions button. Tutorial Name: Hello HDP An Introduction to Hadoop with Hive and Pig If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab. All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Contribution Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it

No comments:

Post a Comment