2013-10-07

Scrapy框架学习笔记2—— Scrapy与Django结合

前面也介绍过了Scrapy与Django的设计思想非常相似，因此这个两个结合也是比较容易的。
以下方法在Scrapy 0.18与Django 1.5下面测试是可以用的。

1.首先设置Django的运行环境

在settings.py中添加如下代码：

def setup_django_environment(path):
    import imp, os, sys
    from django.core.management import setup_environ
    m = imp.load_module('settings', *imp.find_module('settings', [path]))
    setup_environ(m)
    sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))

setup_django_environment("/django/project/path")

注意：如果你的Django项目是用的sqlite数据库的话，那就需要设置为绝对路径，不能使用相对路径。

2.创建django item

首先在Django项目代码中创建一个Django的model，例如:

from django.db import models
class ScrapyModel(models.Model):
    title = models.CharField(max_length=200)
    link = models.CharField(max_length=200)
    desc = models.TextField()

然后在Scrapy项目中创建一个新的Item，只不过这次我们不再是继承自scrapy.item.Item，而是scrapy.contrib.djangoitem.DjangoItem:

1 2	class Test1DjItem(DjangoItem): django_model = ScrapyModel

用法与原来的Item相同，只是最后要执行一个save函数来调用django的save方法将数据存入数据库。

2013-10-07

Python栏目

Scrapy框架学习笔记1——概括

Scrapy是一个python的web爬虫框架，其设计思想与Django非常相似。如果事先用过Django的话那么理解Scrapy应该不难。

0.安装Scrapy

当前Scrapy的稳定版本是0.18
使用pip进行安装： pip install Scrapy
使用easy_install进行安装： easy_install Scrapy

初次使用可以按照它的官方教程操作，先熟悉一下：
http://doc.scrapy.org/en/0.18/intro/tutorial.html

1.一个项目的基本流程

创建新项目 scrapy startproject <name>

然后在<name>/spiders目录下新建一个python文件。
编写一个新的class继承自 scrapy.spider.BaseSpider，并且需要设置如下属性：
name: 该爬虫的名字，不能与其它的相同 start_urls: 开始的url入口
parse(): 对从start_urls的获取的内容进行处理的函数，需要一个Response参数 allowed_domains: 一个列表，表示该爬虫允许爬取的网站域名。可选

爬虫运行流程：
首先调用 start_requests()方法访问start_urls的链接（默认的），然后由parse回调方法对请求的响应进行处理。 在回调中处理Response，并返回Item或者Request，再或者是由这两种对象组成的一个可迭代的对象。
Request对象对应的callback可以与parse相同，也可以不同。

最后运行这个爬虫脚本： scrapy crawl name

也可以直接在终端运行一个选择器： scrapy shell <url>

2.使用Item

scrapy.item.Item的调用接口类似于python的dict，Item包含多个scrapy.item.Field。  
这跟django的Model与Field有点相似。  
Item通常是在Spider的parse方法里使用，它用来保存解析到的数据。

3.使用Item Pipeline

在settings.py中设置ITEM_PIPELINES，其默认为[]，与django的MIDDLEWARE_CLASSES等相似。
从Spider的parse返回的Item数据将依次被ITEM_PIPELINES列表中的Pipeline类处理。

一个Item Pipeline类必须实现以下方法：

process_item(item, spider)

为每个item pipeline组件调用，并且需要返回一个scrapy.item.Item实例对象或者抛出一个scrapy.exceptions.DropItem异常。
当抛出异常后该item将不会被之后的pipeline处理。
参数:
item (Item object) – 由parse方法返回的Item对象
spider (BaseSpider object) – 抓取到这个Item对象对应的爬虫对象

也可额外的实现以下两个方法：
open_spider(spider)

当爬虫打开之后被调用。
参数: spider (BaseSpider object) – 已经运行的爬虫

close_spider(spider)

当爬虫关闭之后被调用。
参数: spider (BaseSpider object) – 已经关闭的爬虫

4.保存数据

想要保存抓取到的数据，最简单的方法是使用Feed exports。参考官方文档：http://doc.scrapy.org/en/0.18/topics/feed-exports.html#topics-feed-exports

例如将抓取的数据导出为json: scrapy crawl dmoz -o items.json -t json

对于小项目用这种方法也足够了。如果是比较复杂的数据的话可能就需要编写一个Item Pipeline进行处理了。

2013-08-24

开源软件

分享几款vim风格的浏览器

作为一个了vim用户，希望所有的操作都能够以vim的按键方式完成，上网浏览网页也不例外。于是乎就找了几款vim按键绑定的浏览器。

0.Firefox + Vimperator / Chrome + Vimium
如果是用的Firefox或者Chrome浏览器的话直接安装相应的插件即可达到想要的效果。
Firefox的是Vimperator；Chrome的是Vimium。

但是感觉Firefox和Chrome有时又太占资源了，我想到一个轻量级的浏览器。

1.w3m
如果你是一个终端控的话，w3m貌似还不错。现在常见的Linux发行版的软件源中应该都包含了w3m，直接通过软件仓库安装就行了。
只不过w3m太简洁了，显示不了CSS样式，执行不了js代码，甚至是显示图片都比较麻烦。

2.uzbl
uzbl是一款基于GTK+Webkit的浏览器，只是它的按键与原生的vim相差比较大，我刚开始操作不太适应。
它的配置文件在$HOME/.config/uzbl目录下，如果把它配置好了的话那使用起来应该也会很不错的。
uzbl默认是不支持多标签的，如果要想打开多个标签请运行uzbl-tabbed程序。
下载地址：http://www.uzbl.org/

linuxdeepin用户可以执行命令 sudo apt-get install uzbl来安装。

3.dwb
dwb是一款基于gtk和webkit的轻量级的浏览器。它的按键与Firefox的Vimperator插件较为相似。
我个人觉得dwb是这几款浏览器中最为方便的，它配置比较简单，而且提供了一个web的配置页面。如图：

下载地址：http://portix.bitbucket.org/dwb/

4.luakit
luakit也是基于gtk和webkit的，只不过它是用lua语言开发的。它的配置文件也是lua脚本，如果想到修改配置的话可能需要懂点lua语言。
我感觉luakit的按键与vim最相近，而且它还提供了页面调试功能，这对做web开发的人来说比较方便。

下载地址：http://mason-larobina.github.io/luakit/

linuxdeepin用户可以执行命令 sudo apt-get install luakit来安装。

2013-08-11

开源软件

分享两个vim的git插件

1.vim-git

我只是简单体验发一下，个人觉得这个就是用vim的命令把git的命令包装了一下，还不如直接在终端下操作方便。

常用命令
:GitAdd
:GitCommit
:GitStatus
:GitLog
:GitCheckout
:GitDiff
:GitPull
:GitPush

快捷键
<Leader>gd 等同于 :GitDiff
<Leader>gD 等同于 :GitDiff —cached
<Leader>gs 等同于 :GitStatus
<Leader>gl 等同于 :GitLog
<Leader>ga 等同于 :GitAdd
<Leader>gA 等同于 :GitAdd <cfile>
<Leader>gc 等同于 :GitCommit

下载地址：https://github.com/motemen/git-vim.git

2.vim-fugitive

感觉这个比较强大，尤其是配合vimdiff来查看文件的发动非常方便。
它的说明文档介绍到如果你的vim版本低于7.2，那么建议同时也把vim-git安装上。

下载地址：https://github.com/tpope/vim-fugitive.git

2013-07-20

数据结构和算法

md5的一个碰撞例子

md5作为计算机安全领域广泛使用的一种散列函数，被用于检测信息是否完整一致。

然而在2004年8月17日的美国加州圣巴巴拉，正在召开的国际密码学会议（Crypto’2004）上来自山东大学的王小云教授做了破译MD5、HAVAL-128、 MD4和RIPEMD算法的报告。她公布了MD系列算法的破解结果。

各位好奇的可以在网上搜索王小云老师的那篇论文，反正我是看不懂的。不过倒是找到了一个md5碰撞的例子，于是就拿来分享一下。

比如有两个文件test1、test2，文件的十六进制内容分别如下：

% xxd test1            
0000000: 0e30 6561 559a a787 d00b c6f7 0bbd fe34  .0eaU..........4
0000010: 04cf 0365 9e70 4f85 34c0 0ffb 659c 4c87  ...e.pO.4...e.L.
0000020: 40cc 942f eb2d a115 a3f4 155c bb86 0749  @../.-.....\...I
0000030: 7386 656d 7d1f 34a4 2059 d78f 5a8d d1ef  s.em}.4. Y..Z...

% xxd test2
0000000: 0e30 6561 559a a787 d00b c6f7 0bbd fe34  .0eaU..........4
0000010: 04cf 0365 9e74 4f85 34c0 0ffb 659c 4c87  ...e.tO.4...e.L.
0000020: 40cc 942f eb2d a115 a3f4 15dc bb86 0749  @../.-.........I
0000030: 7386 656d 7d1f 34a4 2059 d78f 5a8d d1ef  s.em}.4. Y..Z...

如果想得到文件的原始内容，可以把上面的输出信息保存为一个文本文件（如test.txt），然后再执行命令 xxd -r test1.txt > test 即可。
从上面可以看到这两个文件的内容明显是不一样的，那么现在就来计算一下这两个文件的md5值吧。

% md5sum test1
cee9a457e790cf20d4bdaa6d69f01e41  test1
% md5sum test2
cee9a457e790cf20d4bdaa6d69f01e41  test2

从计算结果可以看到，虽然这两个文件内容不同，但是它们的md5值却是相同的。这个就是所谓的碰撞。

2013-07-14

数据结构和算法

Python中的字符串匹配算法分析

在Python在str对象的find、index、replace等操作都是基于字符串匹配的。通过阅读Python的源代码可知，在Python中的字符串匹配算法是基于Boyer-Moore、Horspool和Sunday的混合。

关于它的详细介绍可以访问这个网页： http://effbot.org/zone/stringlib.htm

用Python代码实现如下：

#!/usr/bin/env python
#-*- coding:utf-8 -*-


def make_delta1(p):
    ALPHABET_LEN = 256
    i = 0
    patternlen = len(p)
    delta1 = [0] * ALPHABET_LEN

    while i < ALPHABET_LEN:
        delta1[i] = patternlen
        i += 1

    i = 0
    while i < patternlen-1:
        delta1[ord(p[i])] = patternlen-1 - i
        i += 1
    return delta1


def find(s, p):
    # find first occurrence of p in s
    n = len(s)
    m = len(p)
    delta1 = make_delta1(p)
    skip = delta1[ord(p[m-1])]
    i = 0
    while i <= n-m:
        if s[i+m-1] == p[m-1]:  # (boyer-moore)
            # potential match
            if s[i:i+m-1] == p[:m-1]:
                return i
            if s[i+m] not in p:
                i = i + m + 1   # (sunday)
            else:
                i = i + skip    # (horspool)
        else:
            # skip
            if s[i+m] not in p:
                i = i + m + 1   # (sunday)
            else:
                i = i + 1
    return -1                   # not found

if __name__ == '__main__':
    print find("HERE IS A SILMPLE EXAMPLE", "EXAMPLE")

2013-07-14

数据结构和算法

字符串匹配算法（五）——Horspool算法

Horspool算法是Boyer-Moore算法的一个简化版，全名叫做Boyer-Moore-Horspool算法。

Horspool算法的基本思想是将文本串text中匹配窗口的最后一个字符跟模式串pattern中的最后一个字符比较。如果相等，继续从后向前对主串和模式串进行比较，直到完全相等或者在某个字符处不匹配为止。如果不匹配，则根据文本串匹配窗口中的最后一个字符β在模式串中的下一个出现位置将窗口向右移动；若字符β没有在模式串中出现，则直接将整个模式串滑过这一位。

同样的举个例子来说明一下。假定文本串text为：”HERE IS A SIMPLE EXAMPLE”，长度为n；模式串pattern为：”EXAMPLE”，长度为m；现在要在text中搜索看看是否包含pattern。

1).


HERE IS A SIMPLE EXAMPLE
EXAMPLE

‘S’与’E’匹配失败，并且’S’没有在pattern中出现，所以直接将pattern滑过’S’这一位。

2).


HERE IS A SIMPLE EXAMPLE
       EXAMPLE

这时’P’与’E’匹配失败，但是’P’在模式串pattern中出现了，所以把pattern向右移，使得text中的’P’与pattern中的’P’对齐。

3).


HERE IS A SIMPLE EXAMPLE
         EXAMPLE

然后重复执行之前的操作。

用C语言实现如下。

#include <stdio.h>
#include <string.h>
#include <limits.h>
 
/* Returns a pointer to the first occurrence of "needle"
 * within "haystack", or NULL if not found. Works like
 * memmem().
 */
 
/* Note: In this example needle is a C string. The ending
 * 0x00 will be cut off, so you could call this example with
 * boyermoore_horspool_memmem(haystack, hlen, "abc", sizeof("abc"))
 */
const unsigned char *
boyermoore_horspool_memmem(const unsigned char* haystack, size_t hlen,
                           const unsigned char* needle,   size_t nlen)
{
    int index = 0;
    size_t scan = 0;
    size_t bad_char_skip[UCHAR_MAX + 1]; /* Officially called:
                                          * bad character shift */
 
    /* Sanity checks on the parameters */
    if (nlen <= 0 || !haystack || !needle)
        return NULL;
 
    /* ---- Preprocess ---- */
    /* Initialize the table to default value */
    /* When a character is encountered that does not occur
     * in the needle, we can safely skip ahead for the whole
     * length of the needle.
     */
    for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1)
        bad_char_skip[scan] = nlen;
 
    /* C arrays have the first byte at [0], therefore:
     * [nlen - 1] is the last byte of the array. */
    size_t last = nlen - 1;
 
    /* Then populate it with the analysis of the needle */
    for (scan = 0; scan < last; scan = scan + 1)
        bad_char_skip[needle[scan]] = last - scan;
 
    /* ---- Do the matching ---- */
 
    /* Search the haystack, while the needle can still be within it. */
    while (hlen >= nlen)
    {
        /* scan from the end of the needle */
        for (scan = last; haystack[scan] == needle[scan]; scan = scan - 1)
            if (scan == 0) /* If the first byte matches, we've found it. */
            {
                printf("index: %d\n", index-1);
                return haystack;
            }
 
        /* otherwise, we need to skip some bytes and start again.
           Note that here we are getting the skip value based on the last byte
           of needle, no matter where we didn't match. So if needle is: "abcd"
           then we are skipping based on 'd' and that value will be 4, and
           for "abcdd" we again skip on 'd' but the value will be only 1.
           The alternative of pretending that the mismatched character was
           the last character is slower in the normal case (E.g. finding
           "abcd" in "...azcd..." gives 4 by using 'd' but only
           4-2==2 using 'z'. */
        hlen     -= bad_char_skip[haystack[last]];
        index += bad_char_skip[haystack[last]];
        haystack += bad_char_skip[haystack[last]];
    }
 
    return NULL;
}


int main(int argc, char *argv[])
{
    char text[] = "HERE IS A SILMPLE EXAMPLE";
    char pattern[] = "EXAMPLE";
    char *result = boyermoore_horspool_memmem(text, strlen(text), pattern, strlen(pattern));
    if (result)
    {
        printf("result: %s\n", result);
    } else {
        printf("failed!\n");
    }
    return 0;
}

2013-07-14

数据结构和算法

字符串匹配算法（四）——Boyer-Moore算法

1977年，德克萨斯大学的Robert S. Boyer教授和J Strother Moore教授提出了另一种在O(n)时间复杂度内，完成字符串匹配的算法，其在绝大多数场合的性能表现，比KMP算法还要出色。

Boyer-Moore算法不仅高效，而且构思巧妙，文本编辑器的”查找”功能（Ctrl+F），大多采用的是该算法。

Boyer-Moore算法基本原理是从右向左扫描pattern；与KMP算法类似当遇到不匹配的字符时它也不需要对文本串text进行回溯，而是通过预先计算的Bad-character(坏字符)跳转表以及Good-suffix(好后缀)跳转表进行寻找最大的跳转长度。

下面以Moore教授的例子来讲解一下。这个例子的原始地址： http://www.cs.utexas.edu/~moore/best-ideas/string-searching/fstrpos-example.html

假定文本串text为：”HERE IS A SIMPLE EXAMPLE”，长度为n；模式串pattern为：”EXAMPLE”，长度为m；现在要在text中搜索看看是否包含pattern。

1).

HERE IS A SIMPLE EXAMPLE
EXAMPLE

首先text与pattern左边对齐，从右边开始进行比较。这个方法比较巧妙，因为如果最右边的字符不匹配，那么就只要比较一次就可以知道前7个字符肯定是不相同的。 S与E不匹配，这时'S'被称为Bad-character。并且'S'不包含在pattern之中，这样就可以把pattern直接移动到text的'S'的后一位。 2).

HERE IS A SIMPLE EXAMPLE
       EXAMPLE

仍然是从右边开始比较，发现'P'与'E'不匹配，因此'P'是Bad-character。但是'P'包含中pattern中，因此将pattern后移2位，把两个'P'对齐。注意这里后移时需要用到Bad-character跳转表，这个在后面再介绍。 3).

HERE IS A SIMPLE EXAMPLE
         EXAMPLE

再从右边开始比较，’E’与’E’相匹配，然后再比较下一个字符。

4).

HERE IS A SIMPLE EXAMPLE
         EXAMPLE

几次比较之后得到这样的结果，'A'与'I'不匹配，因此'I'是Bad-character。但是'MPLE'与'MPLE'是匹配的。我们尾部匹配的字符串称为Good-suffix(好后缀)。 5). 根据Bad-character的跳转方法，可以将pattern后移3位，变成如下形式。

HERE IS A SIMPLE EXAMPLE
            EXAMPLE

但是我们可以有更好的方法。这里’MPLE’为Good-suffix，根据Good-suffix跳转方法，将pattern后移6位，变成如下形式。同样的Good-suffix跳转表也将在后面介绍。

HERE IS A SIMPLE EXAMPLE
               EXAMPLE

6).
‘P’与’E’不匹配，因此’P’为Bad-character。再将pattern后移2位，最后变成：

HERE IS A SIMPLE EXAMPLE
                 EXAMPLE

至此整个字符串的匹配就完成了。

HERE IS A SIMPLE EXAMPLE
                 EXAMPLE

接下来介绍一下Bad-character和Good-suffix的跳转方法。
(1).对于Bad-character后移位数计算公式为：
delta1(x) = m ;x != pattern[j] (1 <= j <= m),即坏字符x中pattern中未出现
delta1(x) = m - max{k|pattern[k]=x, 1 <=k<=m} ;其他情况

例如上面的第2步时，’P’与’E’不匹配，因此’P’为Bad-character，在pattern中上次出现的位置为5,因此后移位数就是 7 - 5 = 2。

(2).对于Good-suffix后移位数计算方法为：
1.pattern中间的某一子字符串pattern[j-s+1:m-s] == pattern[j+1:m]，可将pattern右移s位；
2.pattern已比较部分pattern[j+1:m]的后缀pattern[s+1:m]与pattern的前缀pattern[1:m-s]相同，可将pattern右移s位。
满足上面情况的s的最小值为最佳右移距离。
delta2(j) = min{s|(pattern[j+1:m]=pattern[j-s+1:m-s])&&(pattern[j]!=pattern[j-s])(j>s),pattern[s+1:m]=pattern1:m-s}

例如上面的第5步，满足的是情况2.pattern[6+1:7] == pattern[1:7-6] => pattern[7] == pattern[1]，即是’E’ == ‘E’，所以s为6。故右移6位。

在匹配过程中，取delta1和delta2中较大的那一个。

用C语言实现如下：

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
 
#define ALPHABET_LEN 256
#define NOT_FOUND patlen
#define max(a, b) ((a < b) ? b : a)
 
// delta1 table: delta1[c] contains the distance between the last
// character of pat and the rightmost occurence of c in pat.
// If c does not occur in pat, then delta1[c] = patlen.
// If c is at string[i] and c != pat[patlen-1], we can
// safely shift i over by delta1[c], which is the minimum distance
// needed to shift pat forward to get string[i] lined up 
// with some character in pat.
// this algorithm runs in alphabet_len+patlen time.
void make_delta1(int *delta1, uint8_t *pat, int32_t patlen) {
    int i;
    for (i=0; i < ALPHABET_LEN; i++) {
        delta1[i] = NOT_FOUND;
    }
    for (i=0; i < patlen-1; i++) {
        delta1[pat[i]] = patlen-1 - i;
    }
}
 
// true if the suffix of word starting from word[pos] is a prefix 
// of word
int is_prefix(uint8_t *word, int wordlen, int pos) {
    int i;
    int suffixlen = wordlen - pos;
    // could also use the strncmp() library function here
    for (i = 0; i < suffixlen; i++) {
        if (word[i] != word[pos+i]) {
            return 0;
        }
    }
    return 1;
}
 
// length of the longest suffix of word ending on word[pos].
// suffix_length("dddbcabc", 8, 4) = 2
int suffix_length(uint8_t *word, int wordlen, int pos) {
    int i;
    // increment suffix length i to the first mismatch or beginning
    // of the word
    for (i = 0; (word[pos-i] == word[wordlen-1-i]) && (i < pos); i++);
    return i;
}
 
// delta2 table: given a mismatch at pat[pos], we want to align 
// with the next possible full match could be based on what we
// know about pat[pos+1] to pat[patlen-1].
//
// In case 1:
// pat[pos+1] to pat[patlen-1] does not occur elsewhere in pat,
// the next plausible match starts at or after the mismatch.
// If, within the substring pat[pos+1 .. patlen-1], lies a prefix
// of pat, the next plausible match is here (if there are multiple
// prefixes in the substring, pick the longest). Otherwise, the
// next plausible match starts past the character aligned with 
// pat[patlen-1].
// 
// In case 2:
// pat[pos+1] to pat[patlen-1] does occur elsewhere in pat. The
// mismatch tells us that we are not looking at the end of a match.
// We may, however, be looking at the middle of a match.
// 
// The first loop, which takes care of case 1, is analogous to
// the KMP table, adapted for a 'backwards' scan order with the
// additional restriction that the substrings it considers as 
// potential prefixes are all suffixes. In the worst case scenario
// pat consists of the same letter repeated, so every suffix is
// a prefix. This loop alone is not sufficient, however:
// Suppose that pat is "ABYXCDEYX", and text is ".....ABYXCDEYX".
// We will match X, Y, and find B != E. There is no prefix of pat
// in the suffix "YX", so the first loop tells us to skip forward
// by 9 characters.
// Although superficially similar to the KMP table, the KMP table
// relies on information about the beginning of the partial match
// that the BM algorithm does not have.
//
// The second loop addresses case 2. Since suffix_length may not be
// unique, we want to take the minimum value, which will tell us
// how far away the closest potential match is.
void make_delta2(int *delta2, uint8_t *pat, int32_t patlen) {
    int p;
    int last_prefix_index = patlen-1;
 
    // first loop
    for (p=patlen-1; p>=0; p--) {
        if (is_prefix(pat, patlen, p+1)) {
            last_prefix_index = p+1;
        }
        delta2[p] = last_prefix_index + (patlen-1 - p);
    }
 
    // second loop
    for (p=0; p < patlen-1; p++) {
        int slen = suffix_length(pat, patlen, p);
        if (pat[p - slen] != pat[patlen-1 - slen]) {
            delta2[patlen-1 - slen] = patlen-1 - p + slen;
        }
    }
}
 
uint8_t* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) {
    int i;
    int delta1[ALPHABET_LEN];
    int *delta2 = malloc(patlen * sizeof(int));
    make_delta1(delta1, pat, patlen);
    make_delta2(delta2, pat, patlen);
 
    i = patlen-1;
    while (i < stringlen) {
        int j = patlen-1;
        while (j >= 0 && (string[i] == pat[j])) {
            --i;
            --j;
        }
        if (j < 0) {
            free(delta2);
            printf("index: %d\n", i);
            return (string + i+1);
        }
 
        i += max(delta1[string[i]], delta2[j]);
    }
    free(delta2);
    return NULL;
}

int main(int argc, char *argv[])
{
    
    char text[] = "HERE IS A SILMPLE EXAMPLE";
    char pattern[] = "EXAMPLE";
    char *result = boyer_moore(text, strlen(text), pattern, strlen(pattern));
    if (result)
    {
        printf("result: %s\n", result);
    } else {
        printf("failed!\n");
    }
    return 0;
}

2013-07-13

数据结构和算法

字符串匹配算法（三）——KMP算法

KMP算法是常用的字符串匹配算法之一，是由Knuth、Pratt和Morris发明的。其中Knuth就是著名科学家Donald Knuth。

该算法可以在O(n+m)的时间数量级上完成模式匹配操作。与传统算法相比其改进在于每当一趟匹配过程中出现比较不相等时不用回溯目标字符串。

这个算法还是比较难理解，下面通过一个例子来讲解。
有一个目标字符串”BBC ABCDAB ABCDABCDABDE”，我想知道，里面是否包含另一个字符串”ABCDABD”？

1).

BBC ABCDAB ABCDABCDABDE
ABCDABD

首先目标字符串”BBC ABCDAB ABCDABCDABDE”的第一个字符与源字符串”ABCDABD”的第一个字符相比，不匹配，则目标字符串后移一位。 2).

BBC ABCDAB ABCDABCDABDE
 ABCDABD

B与A不匹配，再次后移。 3).

BBC ABCDAB ABCDABCDABDE
    ABCDABD

就这样直到有一个字符相同为止。 4).

BBC ABCDAB ABCDABCDABDE
    ABCDABD

接着比较下一个字符，还是相同的。再继续。 5).

BBC ABCDAB ABCDABCDABDE
    ABCDABD

这时又出现了字符不匹配的情况，如果是传统的算法的话就把目标字符串进行回溯，源字符串后移一位。变成如下情形：

BBC ABCDAB ABCDABCDABDE
     ABCDABD

毕竟这也是最容易理解的，但是这样一来很多字符又需要再重新比较一次。因此效率比较低。 6). 虽然D与空格不匹配，但是前面的ABCDAB是相匹配的。KMP算法就是利用这个已知信息，不要把“搜索”位置称回已经比较过的位置，而且继续比较之后的，这样就减少了重复的比较，提高了效率。继续向后移动变成如下情形：

BBC ABCDAB ABCDABCDABDE
        ABCDABD

这个移动过程需要一张部分匹配表：

j        1234567
源字符串 ABCDABD
next[j]  0111123

因为前6个字符是匹配的，最后一个匹配的字符B对应next值为2,根据公式：移动位数=已匹配字符数-对应的next值因此将源字符串后移6-2=4位。 7).因为空格与C不相同，继续后移。

BBC ABCDAB ABCDABCDABDE
          ABCDABD

8).空格与A不相同，继续重复之前的操作进行后移。

BBC ABCDAB ABCDABCDABDE
           ABCDABD

9).这里C与D不相同，继续用之前的方法进行后移。

BBC ABCDAB ABCDABCDABDE
               ABCDABD

最后匹配完成，字符串”BBC ABCDAB ABCDABCDABDE”里面包含字符串”ABCDABD”。下面介绍一下部分匹配表的计算方法。公式为： next[j] = 0 (j=1) next[j] = Max {k 1<k<j 且 'p(1)...p(k-1)' = 'p(j-k+1)...p(j-1)'} (当此集合不为空时) next[j] = 1 (其他情况) 然后根据公式讲解一下上面例子中的字符串的部分匹配表的计算。(注：这里讲解时是按照数组下标从1开始的，而在C语言中是从0开始的)

j        1234567
源字符串 ABCDABD

由公式得next[1] = 0， next[2] = 1， next[3] = 1， next[4] = 1， next[5] = 1。
当j=6时，p[1]=p5 => k=2 => next[6] = 2。
当j=7时，p[1:2]=p5:6 => k=3 => next[7] = 3。

用C语言实现如下：

#include <stdio.h>

// KMP算法的重点就是求next函数值
void make_next(const char *pattern, int len, unsigned int *next)
{
    int i=1, j=0;
    next[1] = 0;

    while (i <= len) {
        if (j == 0 || pattern[i-1] == pattern[j-1]) {
            i++;
            j++;
            next[i] = j;
        } else {
            j = next[j];
        }
    }
}

int kmp_match(char *t, char *p, int next[], int len1, int len2)
{
    int i=0, j=0;
    while (i < len1 && j < len2) {
        if (j == 0 || t[i] == p[j]) {
            i++;
            j++;
        } else {
            j = next[j];
        }
    }
    if (j >= len2 ) return i - len2;
    return -1;
}

int main(int argc, char *argv[])
{
    char t[] = "BBC ABCDAB ABCDABCDABDE";
    char p[] = "ABCDABD";
    printf("t: %s\np: %s\n", t, p);
    int len = sizeof(p);
    int next[len];
    make_next(p, len-1, next);
    int i;
    for (i=1; i<len; i++) {
        printf("%4d", next[i]);
    }
    printf("\n");
    printf("index: %d\n", kmp_match(t, p, next, sizeof(t)-1, len-1));
    return 0;
}

2013-07-13

数据结构和算法

Python中的排序算法——Timsort

突然对Python中list对象的sort函数比较好奇，想知道它用的是什么算法。正好最近我也在复习算法，于是就随便把Python的sort也研究一下吧。刚开始我猜想可能用是快速排序之类的，最后看了 Python的源代码之后发现其实用的是一个适应性强的、稳定的、自然的归并算法，名为timsort。

Timsort是一种混合的算法，它派生自归并排序和插入排序。它是2002年由Tim Peters为Python语言发明的。
Timsort的时间复杂度最好情况为O(n)，平均情况为O(nlogn)，最差情况为O(nlogn)；空间复杂度为O(n)。
更多关于Timsort的信息： http://en.wikipedia.org/wiki/Timsort

在《Python Cookbook(第2版)中文版》这本书的第5章中由Tim Peters介绍了Python排序算法发展简史。其中说到Timsort算法的实现包括了大约1200行C程序，而且大概有一半是在实现一个技术上的技巧。可见该算法还是很复杂的。

在Python源代码目录下有Objects/listsort.txt这个文件，这是Tim Peters写的一个详细的文档。

有个项目致力于用纯C来实现Timsort算法，如Tim所言代码量超过了1200行。http://code.google.com/p/timsort/