Downloading data from the WRDS website is convenient but not the most transparent and replicable as it requires a workflow along the lines of:

Create a list of identifiers using your program (i.e. Python)
Load the identifiers into the WRDS web interface + make your query
Download the resulting data from WRDS into a file
Load the WRDS data file back into your program

The problem is that step 2 and step 3 happen outside of your code. This essentially leaves a gap in the steps required to run your code, which can leave other people (or your future-self) confused as to what you exactly did on the WRDS website to extract the required data.

For a long time WRDS only offered to programmatically interface with WRDS data through SAS files. Non-SAS clients (think Python and R) could therefore only interact with WRDS data through SAS-bindings. This approach technically worked but was definitely a substantial hassle for many applications. Fortunately, WRDS decided to also make all their data available through a series of PostgreSQL databases. This change happened somewhere in June / July of 2017 (at least for the WRDS Python package). In other words, we can now finally directly interface with WRDS data using tools like Python, R, Stata, and MATLAB!

The WRDS support documents extensively describe how interface with WRDS using the different programs. I will include all links below, but for the sake of completeness will discuss a couple of tips and tricks + some Python code examples to illustrate how one can use it.

Links to WRDS documentation¶

Tips and Tricks

The WRDS documentation for a particular program is usually roughly split into two sections: using the WRDS cloud and using your own system. If you have a sufficiently powerful system I recommend to start with using your own system. As such I don’t recommend to go through all of the documentation linearly, but instead to scroll down to the actual code examples.
Making SQL queries can be very slow depending on the efficiency of your SQL query. If you are going to make a big query I would usually recommend to only use “WHERE .. IN (.., ..)” queries on the primary identifier columns.
The SQL query will return the raw data, this will in nearly all cases differ from the output you get using the WRDS web interface. This is due to the fact that the WRDS web interface usually has pre-selected “data quality” options. For example, the Historical CIK-CUSIP link table from the WRDS SEC Analytics Suite excludes “Invalid Link” and “Name Mismatch” observations whereas the SQL query will by default include all observations. This can be easily resolved by some data cleaning after the SQL query but it is important to inspect your data after the query!

Python example code¶

Import WRDS library¶

Note, the WRDSlibrary can be installed from the command line using pip install wrds
https://pypi.python.org/pypi/wrds

In [1]:

import wrds

In [2]:

db = wrds.Connection()

Enter your WRDS username [TiesPC]: wrds_user
Enter your password:········
Loading library list...
Done

Get META data¶

See all available libraries¶

In [3]:

libraries = db.list_libraries()
libraries[:4]

Out[3]:

['aha_sample', 'ahasamp', 'audit', 'audit_audit_comp']

See all tables in a library¶

In [4]:

tables = db.list_tables('wrdssec')
tables[:4]

Out[4]:

['wrds_forms_reg', 'wrds_13f_link', 'wrds_13f_summary', 'wrds_forms']

See all column headers in a table¶

In [5]:

col_headers = db.describe_table(library='wrdssec', table='wrds_nlpsa')

Approximately 45332100 rows in wrdssec.wrds_nlpsa.

In [6]:

col_headers.head()

	name	nullable	type
0	gvkey	True	VARCHAR(6)
1	cik	True	VARCHAR(10)
2	fdate	True	DATE
3	form	True	VARCHAR(12)
4	coname	True	VARCHAR(62)

Querying data¶

There are two ways:

get_table()
raw_sql()

For a table with not too many rows (like some of the linktables) you can use get_table(), otherwise raw_sql()is advisable.

Get first 5 rows using `get_table()`¶

In [9]:

data_5rows = db.get_table(library='wrdssec', table='wrds_nlpsa', columns = ['gvkey', 'cik', 'filename', 'finterms_litigious_count'], obs=5)

In [10]:

data_5rows

	gvkey	cik	filename	finterms_litigious_count
0	None	0000000003	000000/3/0000950103-94-000082.txt	0.0
1	None	0000000003	000000/3/0000934850-95-001755.txt	0.0
2	None	0000000003	000000/3/0000000003-96-000001.txt	0.0
3	None	0000000003	000000/3/0000934850-97-000209.txt	2.0
4	None	0000000003	000000/3/0000934850-98-001281.txt	0.0

Query using `raw_sql()`¶

In [13]:

data_5rows.cik.values

Out[13]:

array(['0000000003', '0000000003', '0000000003', '0000000003', '0000000003'], dtype=object)

In [14]:

sql_query = """
SELECT gvkey,
       cik,
       filename,
       finterms_litigious_count
FROM wrdssec.wrds_nlpsa
WHERE cik IN ('0000000003', '0000000003', '0000000003', '0000000003', '0000000003')
"""

In [15]:

data_query = db.raw_sql(sql_query)

In [16]:

data_query

	gvkey	cik	filename	finterms_litigious_count
0	None	0000000003	000000/3/0000950103-94-000082.txt	0.0
1	None	0000000003	000000/3/0000934850-95-001755.txt	0.0
2	None	0000000003	000000/3/0000000003-96-000001.txt	0.0
3	None	0000000003	000000/3/0000934850-97-000209.txt	2.0
4	None	0000000003	000000/3/0000934850-98-001281.txt	0.0
5	None	0000000003	000000/3/0000934850-98-002151.txt	0.0
6	None	0000000003	000000/3/0000934850-98-002871.txt	0.0

Retrieving data from WRDS directly using Python, R, and Stata

Links to WRDS documentation¶

Tips and Tricks

Python example code¶

Import WRDS library¶

Get META data¶

See all available libraries¶

See all tables in a library¶

See all column headers in a table¶

Querying data¶

Get first 5 rows using `get_table()`¶

Query using `raw_sql()`¶

About Ties de Kok

Related posts

Retrieving data from WRDS directly using Python, R, and Stata

Links to WRDS documentation¶

Tips and Tricks

Python example code¶

Import WRDS library¶

Get META data¶

See all available libraries¶

See all tables in a library¶

See all column headers in a table¶

Querying data¶

Get first 5 rows using get_table()¶

Query using raw_sql()¶

About Ties de Kok

Related posts

Get first 5 rows using `get_table()`¶

Query using `raw_sql()`¶