Retrieving data from WRDS directly using Python, R, and Stata

Posted by Ties de Kok - Feb 08, 2018
0
3142

Downloading data from the WRDS website is convenient but not the most transparent and replicable as it requires a workflow along the lines of:

  1. Create a list of identifiers using your program (i.e. Python)
  2. Load the identifiers into the WRDS web interface + make your query
  3. Download the resulting data from WRDS into a file
  4. Load the WRDS data file back into your program

The problem is that step 2 and step 3 happen outside of your code. This essentially leaves a gap in the steps required to run your code, which can leave other people (or your future-self) confused as to what you exactly did on the WRDS website to extract the required data.

For a long time WRDS only offered to programmatically interface with WRDS data through SAS files. Non-SAS clients (think Python and R) could therefore only interact with WRDS data through SAS-bindings. This approach technically worked but was definitely a substantial hassle for many applications. Fortunately, WRDS decided to also make all their data available through a series of PostgreSQL databases. This change happened somewhere in June / July of 2017 (at least for the WRDS Python package). In other words, we can now finally directly interface with WRDS data using tools like Python, R, Stata, and MATLAB!

The WRDS support documents extensively describe how interface with WRDS using the different programs. I will include all links below, but for the sake of completeness will discuss a couple of tips and tricks + some Python code examples to illustrate how one can use it.

Tips and Tricks

  • The WRDS documentation for a particular program is usually roughly split into two sections: using the WRDS cloud and using your own system. If you have a sufficiently powerful system I recommend to start with using your own system. As such I don’t recommend to go through all of the documentation linearly, but instead to scroll down to the actual code examples.
  • Making SQL queries can be very slow depending on the efficiency of your SQL query. If you are going to make a big query I would usually recommend to only use “WHERE .. IN (.., ..)” queries on the primary identifier columns.
  • The SQL query will return the raw data, this will in nearly all cases differ from the output you get using the WRDS web interface. This is due to the fact that the WRDS web interface usually has pre-selected “data quality” options. For example, the Historical CIK-CUSIP link table from the WRDS SEC Analytics Suite excludes “Invalid Link” and “Name Mismatch” observations whereas the SQL query will by default include all observations. This can be easily resolved by some data cleaning after the SQL query but it is important to inspect your data after the query!

Python example code

Import WRDS library

Note, the WRDSlibrary can be installed from the command line using pip install wrds
https://pypi.python.org/pypi/wrds

In [1]:
import wrds
In [2]:
db = wrds.Connection()
Enter your WRDS username [TiesPC]: wrds_user
Enter your password:········
Loading library list...
Done

Get META data

See all available libraries

In [3]:
libraries = db.list_libraries()
libraries[:4]
Out[3]:
['aha_sample', 'ahasamp', 'audit', 'audit_audit_comp']

See all tables in a library

In [4]:
tables = db.list_tables('wrdssec')
tables[:4]
Out[4]:
['wrds_forms_reg', 'wrds_13f_link', 'wrds_13f_summary', 'wrds_forms']

See all column headers in a table

In [5]:
col_headers = db.describe_table(library='wrdssec', table='wrds_nlpsa')
Approximately 45332100 rows in wrdssec.wrds_nlpsa.
In [6]:
col_headers.head()
name nullable type
0 gvkey True VARCHAR(6)
1 cik True VARCHAR(10)
2 fdate True DATE
3 form True VARCHAR(12)
4 coname True VARCHAR(62)

Querying data

There are two ways:

  1. get_table()
  2. raw_sql()

For a table with not too many rows (like some of the linktables) you can use get_table(), otherwise raw_sql()is advisable.

Get first 5 rows using get_table()

In [9]:
data_5rows = db.get_table(library='wrdssec', table='wrds_nlpsa', columns = ['gvkey', 'cik', 'filename', 'finterms_litigious_count'], obs=5)
In [10]:
data_5rows
gvkey cik filename finterms_litigious_count
0 None 0000000003 000000/3/0000950103-94-000082.txt 0.0
1 None 0000000003 000000/3/0000934850-95-001755.txt 0.0
2 None 0000000003 000000/3/0000000003-96-000001.txt 0.0
3 None 0000000003 000000/3/0000934850-97-000209.txt 2.0
4 None 0000000003 000000/3/0000934850-98-001281.txt 0.0

Query using raw_sql()

In [13]:
data_5rows.cik.values
Out[13]:
array(['0000000003', '0000000003', '0000000003', '0000000003', '0000000003'], dtype=object)
In [14]:
sql_query = """
SELECT gvkey,
       cik,
       filename,
       finterms_litigious_count
FROM wrdssec.wrds_nlpsa
WHERE cik IN ('0000000003', '0000000003', '0000000003', '0000000003', '0000000003')
"""
In [15]:
data_query = db.raw_sql(sql_query)
In [16]:
data_query
gvkey cik filename finterms_litigious_count
0 None 0000000003 000000/3/0000950103-94-000082.txt 0.0
1 None 0000000003 000000/3/0000934850-95-001755.txt 0.0
2 None 0000000003 000000/3/0000000003-96-000001.txt 0.0
3 None 0000000003 000000/3/0000934850-97-000209.txt 2.0
4 None 0000000003 000000/3/0000934850-98-001281.txt 0.0
5 None 0000000003 000000/3/0000934850-98-002151.txt 0.0
6 None 0000000003 000000/3/0000934850-98-002871.txt 0.0

 

 

WordPress Cookie Plugin by Real Cookie Banner