Development¶
How to contribute¶
You can contribute to improve Django Dynamic Scraper in many ways:
- If you stumbled over a bug or have suggestions for an improvements or a feature addition report an issue on the GitHub page with a good description.
- If you have already fixed the bug or added the feature in the DDS code you can also make a pull request on GitHub. While I can’t assure that every request will be taken over into the DDS source I will look at each request closely and integrate it if I fell that it’s a good fit!
- Since this documentation is also available in the Github repository of DDS you can also make pull requests for documentation!
Here are some topics for which suggestions would be especially interesting:
- If you worked your way through the documentation and you were completely lost at some point, it would be helpful to know where that was.
- If there are unnecessary limitations of the Scrapy functionality in the DDS source which could be eliminated without adding complexity to the way you can use DDS that would be very interesting to know.
And finally: please let me know about how you are using Django Dynamic Scraper!
Running the test suite¶
Overview¶
Tests for DDS
are organized in a separate tests
Django project in the root folder of the repository.
Due to restrictions of Scrapy’s networking engine Twisted, DDS
test cases directly
testing scrapers have to be run as new processes and can’t be executed sequentially via python manage.py test.
For running the tests first go to the tests directory and start a test server with:
./testserver.sh
Then you can run the test suite with:
./run_tests.sh
Note
If you are testing for DDS Django/Scrapy version compatibility: there might be 2-3 tests generally not working properly, so if just a handful of tests don’t pass have a closer look at the test output.
Django test apps¶
There are currently two Django apps containing tests. The basic
app testing scraper unrelated functionality
like correct processor output or scheduling time calculations. These tests can be run on a per-file-level:
python manage.py test basic.processors_test.ProcessorsTest
The scraper
app is testing scraper related functionality. Tests can either be run via shell script (see above)
or on a per-test-case level like this:
python manage.py test scraper.scraper_run_test.ScraperRunTest.test_scraper #Django 1.6+
python manage.py test scraper.ScraperRunTest.test_scraper #Django up to 1.5
Have a look at the run_tests.sh
shell script for more examples!
Running ScrapyJS/Splash JS rendering tests¶
Unit tests testing ScrapyJS/Splash
Javascript rendering functionality need a working ScrapyJS/Splash
(docker)
installation and are therefor run separately with:
./run_js_tests.sh
Test cases are located in scraper.scraper_js_run_test.ScraperJSRunTest
. Some links:
SPLASH_URL
in scraper.settings.base_settings.py
has to be adopted to your local installation to get this running!
Docker container can be run with:
docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 -d scrapinghub/splash
Note
For rendering websites served on localhost
from within Docker/Splash
, you can connect to localhost
outside the Docker container
via http://10.0.2.2
(see e.g. Stackoverflow).
Release Notes¶
Changes in version 0.8.0-beta (2015-09-22)
- New request page types for main page and detail pages of scrapers (see: Adding corresponding request page types):
- Cleaner association of request options like content or request type to main or detail pages (see: Advanced Request Options)
- More flexibility in using different request options for main and detail pages (rendering Javascript on main but not on detail pages, different HTTP header or body values,...)
- Allowance of several detail page URLs per scraper
- Possibility for not saving the detail page URL used for scraping by unchecking corresponding new
ScrapedObjClass
attributesave_to_db
- ATTENTION! This release comes with heavy internal changes regarding both DB structure and scraping logic. Unit tests are running through, but there might be untested edge cases. If you want to use the new functionality in a production environment please do this with extra care. You also might want to wait for 2-3 weeks after release and/or for a following 0.8.1 release (not sure if necessary yet). If you upgrade it is HIGHLY RECOMMENDED TO BACKUP YOUR PROJECT AND YOUR DB before!
- Replaced Scrapy
Spider
withCrawlSpider
class being the basis forDjangoBaseSpider
, allowing for more flexibility when extending - Custom migration for automatically creating new
RequestPageType
objects for existing scrapers - Unit tests for new functionality
- Partly restructured documentation, separate Installation section
- New migrations
0008
,0009
,0010
, run Djangomigration
command
Changes in version 0.7.3-beta (2015-08-10)
- New attribute
dont_filter
forScraper
request options (see: Advanced Request Options), necessary for some scenarios whereScrapy
falsely marks (and omits) requests as being duplicate (e.g. when scraping uniform URLs together with custom HTTP header pagination) - Fixed bug preventing processing of
JSON
with non-string data types (e.g.Number
) for scraped attributes, values are now automatically converted toString
- New migration
0007
, run Djangomigration
command
Changes in version 0.7.2-beta (2015-08-06)
- Added new
method
attribute toScraper
not binding HTTP method choice (GET
/POST
) so strictly to choice ofrequest_type
(allowing e.g. more flexiblePOST
requests), see: Advanced Request Options - Added new
body
attribute toScraper
allowing for sending custom requestHTTP message body
data, see: Advanced Request Options - Allowing
pagination
forheaders
,body
attributes - Allowing of
ScrapedObjectClass
definitions inDjango admin
with no attributes defined asID field
(omits double checking process when used) - New migration
0006
, run Djangomigration
command
Changes in version 0.7.1-beta (2015-08-03)
- Fixed severe bug preventing
pagination
forcookies
andform_data
to work properly - Added a new section in the docs for Advanced Request Options
- Unit tests for some scraper request option selections
Changes in version 0.7.0-beta (2015-07-31)
- Adding additional HTTP header attributes to scrapers in Django admin
- Cookie support for scrapers
- Passing Scraper specific Scrapy meta data
- Support for form requests, passing form data within requests
- Pagination support for cookies, form data
- New migration
0005
, run Djangomigration
command - All changes visible in Scraper form of Django admin
- ATTENTION! While unit tests for existing functionality all passing through, new functionality is not heavily tested yet due to problems in creating test scenarios. If you want to use the new functionality in a production environment please test with extra care. You also might want to wait for 2-3 weeks after release and/or for a following 0.7.1 release (not sure if necessary yet)
- Please report problems/bugs on GitHub.
Changes in version 0.6.0-beta (2015-07-14)
- Replaced implicit and static ID concept of mandatory
DETAIL_PAGE_URL
type attribute serving as ID with a more flexible concept of explicitly settingID Fields
forScrapedObjClass
inDjango
admin (see: Defining the object to be scraped) - New attribute
id_field
forScrapedObjClass
, please run Djangomigration
command (migration0004
) DETAIL_PAGE_URL
type attribute not necessary any more when defining the scraped object class allowing for more scraping use cases (classic and simple/flat datasets not referencing a certain detail page)- Single
DETAIL_PAGE_URL
typeID Field
still necessary for usingDDS
checker functionality (see: Defining item checkers) - Additional form checks for
ScrapedObjClass
definition inDjango
admin
Changes in version 0.5.2-beta (2015-06-18)
- Two new processors
ts_to_date
andts_to_time
to extract local date/time from unix timestamp string (see: Processors)
Changes in version 0.5.1-beta (2015-06-17)
- Make sure that
Javascript
rendering is only activated for pages withHTML
content type
Changes in version 0.5.0-beta (2015-06-10)
- Support for creating
JSON/JSONPath
scrapers for scrapingJSON
encoded pages (see: Scraping JSON content) - Added new separate content type choice for detail pages and checkers (e.g. main page in
HTML
, detail page inJSON
) - New Scraper model attribute
detail_page_content_type
, please run Djangomigration
command (migration0003
) - New library dependency
python-jsonpath-rw 1.4+
(see Requirements) - Updated unit tests to support/test
JSON
scraping
Changes in version 0.4.2-beta (2015-06-05)
- Possibility to customize
Splash
args with new settingDSCRAPER_SPLASH_ARGS
(see: Setting up ScrapyJS/Splash (Optional))
Changes in version 0.4.1-beta (2015-06-04)
- Support for
Javascript
rendering of scraped pages viaScrapyJS/Splash
- Feature is optional and needs a working ScrapyJS/Splash deployment, see Requirements and Setting up ScrapyJS/Splash (Optional)
- New attribute
render_javascript
forScraper
model, runpython manage.py migrate dynamic_scraper
to apply (migration0002
) - New unit tests for Javascript rendering (see: Running ScrapyJS/Splash JS rendering tests)
Changes in version 0.4.0-beta (2015-06-02)
- Support for
Django 1.7/1.8
andScrapy 0.22/0.24
. Earlier versions not supported any more from this release on, if you need another configuration have a look at theDDS 0.3.x
branch (new features won’t be back-ported though) (see Release Compatibility Table) - Switched to Django migrations, removed
South
dependency - Updated core library to work with
Django 1.7/1.8
(Django 1.6
and older not working any more) - Replaced deprecated calls logged when run under
Scrapy 0.24
(Scrapy 0.20
and older not working any more) - Things to consider when updating Scrapy: new
ITEM_PIPELINES
dict format, standalonescrapyd
with changedscrapy.cfg
settings and new deployment procedure (see: Scrapy Configuration) - Adopted
example_project
andtests
Django projects to work with the updated dependecies - Updated
open_news.json
example project fixture - Changed
DDS
status toBeta
Changes in version 0.3.14-alpha (2015-05-30)
- Pure documentation update release to get updated
Scrapy 0.20/0.22/.24
compatibility info in the docs (see: Release Compatibility Table)
Changes in version 0.3.13-alpha (2015-05-29)
- Adopted test suite to pass through under
Scrapy 0.18
(Tests don’t work withScrapy 0.16
any more) - Added
Scrapy 0.18
to release compatibility table (see: Release Compatibility Table)
Changes in version 0.3.12-alpha (2015-05-28)
- Added new release compatibility overview table to docs (see: Release Compatibility Table)
- Adopted
run_tests.sh
script to run withDjango 1.6
- Tested
Django 1.5
,Django 1.6
for compatibility withDDS v.0.3.x
- Updated title xpath in fixture for Wikinews example scraper
Changes in version 0.3.11-alpha (2015-04-20)
- Added
only-active
and--report-only-erros
options torun_checker_tests
management command (see: Run checker tests)
Changes in version 0.3.10-alpha (2015-03-17)
- Added missing management command for checker functionality tests to distribution (see: Run checker tests)
Changes in version 0.3.9-alpha (2015-01-23)
- Added new setting
DSCRAPER_IMAGES_STORE_FORMAT
for more flexibility with storing original and/or thumbnail images (see Scraping images/screenshots)
Changes in version 0.3.8-alpha (2014-10-14)
- Added ability for
duration
processor to break down and parse second values greater than one hour in total (>= 3600 seconds) (see: Processors)
Changes in version 0.3.7-alpha (2014-03-20)
- Improved
run_checker_tests
management command with--send-admin-mail
flag for usage of command in cronjob (see: Run checker tests)
Changes in version 0.3.6-alpha (2014-03-19)
- Added new admin action clone_scrapers to get a functional copy of scrapers easily
Changes in version 0.3.5-alpha (2013-11-02)
- Add super init method to call init method in Scrapy BaseSpider class to DjangoBaseSpider init method (see Pull Request #32)
Changes in version 0.3.4-alpha (2013-10-18)
- Fixed bug displaying wrong message in checker tests
- Removed
run_checker_tests
celery task (which wasn’t working anyway) and replaced it with a simple Django management commandrun_checker_tests
to run checker tests for all scrapers
Changes in version 0.3.3-alpha (2013-10-16)
- Making status list editable in Scraper admin overview page for easier status change for many scrapers at once
- Possibility to define
x_path
checkers with blankchecker_x_path_result
, the checker is then succeeding if elements are found on page (before this lead to an error message)
Changes in version 0.3.2-alpha (2013-09-28)
- Fixed the exception when scheduler string was processed (see Pull Request #27)
- Allowed Checker Reference URLs to be longer than the the default 200 characters (DB Migration
0004
, see Pull Request #29) - Changed
__unicode__
method forSchedulerRuntime
to preventTypeError
(see Google Groups Discussion) - Refer to
ID
instead ofPK
(see commit in nextlanding repo)
Changes in version 0.3.1-alpha (2013-09-03)
- Possibility to add keyword arguments to spider and checker task method to specify which reference objects to use for spider/checker runs (see: Defining your tasks)
Changes in version 0.3-alpha (2013-01-15)
- Main purpose of release is to upgrade to new libraries (Attention: some code changes necessary!)
Scrapy 0.16
: TheDjangoItem
class used by DDS moved fromscrapy.contrib_exp.djangoitem
toscrapy.contrib.djangoitem
. Please update your Django model class accordingly (see: Creating your Django models).Scrapy 0.16
:BOT_VERSION
setting no longer used in Scrapy/DDSsettings.py
file (see: Setting up Scrapy)Scrapy 0.16
: Some minor import changes for DDS to get rid of deprecated settings importDjango 1.5
: Changed Django settings configuration, please update your Scrapy/DDSsettings.py
file (see: Setting up Scrapy)django-celery 3.x
: Simpler installation, updated docs accordingly (see: Installing/configuring django-celery for DDS)- New log output about which Django settings used when running a scraper
Changes in version 0.2-alpha (2012-06-22)
- Substantial API and DB layout changes compared to version 0.1
- Introduction of South for data migrations
Changes in version 0.1-pre-alpha (2011-12-20)
- Initial version
Roadmap¶
pre-alpha
Django Dynamic Scraper’s pre-alpha phase was meant to be for people interested having a first look at the library and give some feedback if things were making generally sense the way they were worked out/conceptionally designed or if a different approach on implementing some parts of the software would have made more sense.
alpha (current)
DDS is currently in alpha stadium, which means that the library has proven itself in (at least) one production environment and can be (cautiously) used for production purposes. However being still very early in develpment, there are still API and DB changes for improving the lib in different ways. The alpha stadium will be used for getting most parts of the API relatively stable and eliminate the most urgent bugs/flaws from the software.
beta
In the beta phase the API of the software should be relatively stable, though occasional changes will still be possible if necessary. The beta stadium should be the first period where it is save to use the software in production and beeing able to rely on its stability. Then the software should remain in beta for some time.
Version 1.0
Version 1.0 will be reached when the software has matured in the beta phase and when at least 10+ projects are using DDS productively for different purposes.