For this notebook, we're using data downloaded from data.cms.gov. I didn't commit a copy of the data to this repo as it's fairly large. To run this notebook, you need to first need to export the data below as CSV (non-Excel), then copy it into the data/dl directory of this project:

Use the PYSPARK_SUBMIT_ARGS environment variable to load additional Spark packages we'll need

Specifically:

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.11:1.5.0 pyspark-shell'

Set up a SparkContext and SQLContext to work with, and load the health provider data into it.

In [2]:
import pyspark
sc = pyspark.SparkContext('local[*]')
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
In [14]:
csvPath = "data/dl/ipp_sps.csv"
In [15]:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(csvPath)
In [16]:
df.show()
+--------------------+-----------+--------------------+-----------------------+-------------+--------------+-----------------+------------------------------------+------------------+-------------------------+------------------------+-------------------------+
|      DRG Definition|Provider Id|       Provider Name|Provider Street Address|Provider City|Provider State|Provider Zip Code|Hospital Referral Region Description| Total Discharges | Average Covered Charges | Average Total Payments |Average Medicare Payments|
+--------------------+-----------+--------------------+-----------------------+-------------+--------------+-----------------+------------------------------------+------------------+-------------------------+------------------------+-------------------------+
|039 - EXTRACRANIA...|      10001|SOUTHEAST ALABAMA...|   1108 ROSS CLARK C...|       DOTHAN|            AL|            36301|                         AL - Dothan|                91|                $32963.07|                $5777.24|                 $4763.73|
|039 - EXTRACRANIA...|      10005|MARSHALL MEDICAL ...|   2505 U S HIGHWAY ...|         BOAZ|            AL|            35957|                     AL - Birmingham|                14|                $15131.85|                $5787.57|                 $4976.71|
|039 - EXTRACRANIA...|      10006|ELIZA COFFEE MEMO...|     205 MARENGO STREET|     FLORENCE|            AL|            35631|                     AL - Birmingham|                24|                $37560.37|                $5434.95|                 $4453.79|
|039 - EXTRACRANIA...|      10011|   ST VINCENT'S EAST|   50 MEDICAL PARK E...|   BIRMINGHAM|            AL|            35235|                     AL - Birmingham|                25|                $13998.28|                $5417.56|                 $4129.16|
|039 - EXTRACRANIA...|      10016|SHELBY BAPTIST ME...|   1000 FIRST STREET...|    ALABASTER|            AL|            35007|                     AL - Birmingham|                18|                $31633.27|                $5658.33|                 $4851.44|
|039 - EXTRACRANIA...|      10023|BAPTIST MEDICAL C...|   2105 EAST SOUTH B...|   MONTGOMERY|            AL|            36116|                     AL - Montgomery|                67|                $16920.79|                $6653.80|                 $5374.14|
|039 - EXTRACRANIA...|      10029|EAST ALABAMA MEDI...|   2000 PEPPERELL PA...|      OPELIKA|            AL|            36801|                     AL - Birmingham|                51|                $11977.13|                $5834.74|                 $4761.41|
|039 - EXTRACRANIA...|      10033|UNIVERSITY OF ALA...|   619 SOUTH 19TH ST...|   BIRMINGHAM|            AL|            35233|                     AL - Birmingham|                32|                $35841.09|                $8031.12|                 $5858.50|
|039 - EXTRACRANIA...|      10039| HUNTSVILLE HOSPITAL|          101 SIVLEY RD|   HUNTSVILLE|            AL|            35801|                     AL - Huntsville|               135|                $28523.39|                $6113.38|                 $5228.40|
|039 - EXTRACRANIA...|      10040|GADSDEN REGIONAL ...|   1007 GOODYEAR AVENUE|      GADSDEN|            AL|            35903|                     AL - Birmingham|                34|                $75233.38|                $5541.05|                 $4386.94|
|039 - EXTRACRANIA...|      10046|RIVERVIEW REGIONA...|   600 SOUTH THIRD S...|      GADSDEN|            AL|            35901|                     AL - Birmingham|                14|                $67327.92|                $5461.57|                 $4493.57|
|039 - EXTRACRANIA...|      10055|    FLOWERS HOSPITAL|   4370 WEST MAIN ST...|       DOTHAN|            AL|            36305|                         AL - Dothan|                45|                $39607.28|                $5356.28|                 $4408.20|
|039 - EXTRACRANIA...|      10056|ST VINCENT'S BIRM...|   810 ST VINCENT'S ...|   BIRMINGHAM|            AL|            35205|                     AL - Birmingham|                43|                $22862.23|                $5374.65|                 $4186.02|
|039 - EXTRACRANIA...|      10078|NORTHEAST ALABAMA...|   400 EAST 10TH STREET|     ANNISTON|            AL|            36207|                     AL - Birmingham|                21|                $31110.85|                $5366.23|                 $4376.23|
|039 - EXTRACRANIA...|      10083|SOUTH BALDWIN REG...|   1613 NORTH MCKENZ...|        FOLEY|            AL|            36535|                         AL - Mobile|                15|                $25411.33|                $5282.93|                 $4383.73|
|039 - EXTRACRANIA...|      10085|DECATUR GENERAL H...|     1201 7TH STREET SE|      DECATUR|            AL|            35609|                     AL - Huntsville|                27|                 $9234.51|                $5676.55|                 $4509.11|
|039 - EXTRACRANIA...|      10090| PROVIDENCE HOSPITAL|   6801 AIRPORT BOUL...|       MOBILE|            AL|            36608|                         AL - Mobile|                27|                $15895.85|                $5930.11|                 $3972.85|
|039 - EXTRACRANIA...|      10092|D C H REGIONAL ME...|   809 UNIVERSITY BO...|   TUSCALOOSA|            AL|            35401|                     AL - Tuscaloosa|                31|                $19721.16|                $6192.54|                 $5179.38|
|039 - EXTRACRANIA...|      10100|     THOMAS HOSPITAL|      750 MORPHY AVENUE|     FAIRHOPE|            AL|            36532|                         AL - Mobile|                18|                $10710.88|                $4968.00|                 $3898.88|
|039 - EXTRACRANIA...|      10103|BAPTIST MEDICAL C...|   701 PRINCETON AVE...|   BIRMINGHAM|            AL|            35211|                     AL - Birmingham|                33|                $51343.75|                $5996.00|                 $4962.45|
+--------------------+-----------+--------------------+-----------------------+-------------+--------------+-----------------+------------------------------------+------------------+-------------------------+------------------------+-------------------------+
only showing top 20 rows

In [17]:
df.printSchema()
root
 |-- DRG Definition: string (nullable = true)
 |-- Provider Id: integer (nullable = true)
 |-- Provider Name: string (nullable = true)
 |-- Provider Street Address: string (nullable = true)
 |-- Provider City: string (nullable = true)
 |-- Provider State: string (nullable = true)
 |-- Provider Zip Code: integer (nullable = true)
 |-- Hospital Referral Region Description: string (nullable = true)
 |--  Total Discharges : integer (nullable = true)
 |--  Average Covered Charges : string (nullable = true)
 |--  Average Total Payments : string (nullable = true)
 |-- Average Medicare Payments: string (nullable = true)

How many distinct health providers in the data?

In [18]:
df.select("Provider Id").distinct().count()
Out[18]:
3337

Count the number of types of reimbursements by provider name and sort descending

In [19]:
df.groupBy("Provider Id").count().sort('count', ascending=False).show()
+-----------+-----+
|Provider Id|count|
+-----------+-----+
|     370091|  100|
|     220086|  100|
|     180088|  100|
|     220071|  100|
|     160083|  100|
|     330286|  100|
|     100088|  100|
|     230047|  100|
|     230269|  100|
|     220074|  100|
|     100075|  100|
|     450058|  100|
|     110035|  100|
|     230038|  100|
|     390049|  100|
|     180040|  100|
|     360035|  100|
|     340040|  100|
|     440049|  100|
|     340032|  100|
+-----------+-----+
only showing top 20 rows

In [ ]: