Evaluating bids

Map Reduce Bigdata

Published on the August 09, 2019 in IT & Programming

About this project

Open

Your project must incorporate the following elements:
1. Consider a large dataset and the size of the dataset should justify the complexity level (Struc-
tured or Semi-structured or Non-structured). Source datasets can be static (file or database)
or programmatically retrieved from an API/ Web Service/ Scrape, or a mixture of the two.


2. Utilisation of a distributed data processing environment (e.g., Hadoop Map-reduce or Spark),
for some part of the analysis.
3. Source dataset(s) should be stored into an appropriate SQL/ NoSQL database(s) prior to pro-
cessing by MapReduce / Spark (HBase / HIVE / Spark SQL / Kudu / Cassandra / MongoDB /
etc.) The data should be populated into the NoSQL database using an appropriate tool (Sqoop
/ Spark / Pentaho / Talend / etc.)
4. Post Map-reduce processing dataset(s) should be stored in an appropriate NoSQL database(s)
(Follow a similar choice as in the previous step no. 3 or a different choice)
5. Programmatically accessing the source data from the chosen NoSQL database using appro-
priate MapReduce / Spark code (i.e.

You should not extract the data to text files before run-
ning the MapReduce / Spark task, the MapReduce / Spark task should read directly from the
database).
6. Programmatically storing the MapReduce/ Spark output data into the chosen NoSQL output
database (again, the MapReduce / Spark task should write directly to the database).
7. Follow-up analysis on the output data. It can be extracted from the NoSQL database into an-
other format, using an appropriate tool, if necessary (e.g. Extract to CSV to import into R/
Python/ Matlab/ Qlik / Power BI / Tableau/ SPSS).


For example, you may initially utilise MySQL to store a large structured dataset (any NoSQL data-
base for semi or non-structured dataset) and then your Hadoop MapReduce/ Spark processing
would utilise the MySQL (NoSQL) database as an input source. After processing the data through
MapReduce/ Spark, you may then store the Big data into HBase or Hive or MongoDB.
Following that you may use Python/NumPy/Pandas/Matplotlib/Matlab or R/ggplot/plotly to con-
duct further analysis of the MapReduce output data (e.g.,

Statistical analysis), and generate data
visualisation plots for better presentation of the results. Alternatively, you can import output gen-
erated data into Microsoft Powerbi/ ibm spss to generate the analyses.

Category IT & Programming
Subcategory Other
Project size Small
Is this a project or a position? Project
I currently have I have specifications
Required availability As needed

Delivery term: Not specified

Skills needed