Data and analytics
The use of Python for data science and analytics is growing in popularity and one reason for this is the excellent supporting libraries (NumPy, SciPy, pandas, Statsmodels (ref), Scikit-Learn, and Matplotlib, to name the most common ones). One obstacle to adoption can be lack of documentation: e.g. Statsmodels documentation is sparse and assumes a fair level of statistical knowledge to make use of it. This article shows how one feature of Statsmodels, namely Generalized Linear Models (GLM), can be used to build useful models for understanding count data.
When Apache Spark 1.0 was released in mid-2014, it was quickly recognised for its compelling rethink of the conventions of large-scale data processing. Stepping away from the then prevalent MapReduce mindset, Spark introduced the concept of the Resilient Distributed Dataset (RDD); an immutable data multiset distributed over a cluster.