博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Android(Java) 模拟登录知乎并抓取用户信息
阅读量:4131 次
发布时间:2019-05-25

本文共 14090 字,大约阅读时间需要 46 分钟。

前不久,看到一篇文章,该文章中使用的登录方式是直接复制cookie到代码中,这里呢,我不以爬信息为目的。只是简单的介绍使用java来进行模拟登录的基本过程,之前写过的文章其实就是模拟登录的范畴。再加上最近在知乎上看到很多人问关于超级课程表的实现,其实本质就是模拟登录,掌握了这篇文章的内容,你不再担心抓不到信息了。然后,这篇文章会使用到之前的一篇Cookie保持的文章,还有Jsoup的使用,为了简单处理,直接使用javaSE来,而不再使用Android进行。如果要移植到Android,唯一的处理可能就是把网络请求工作扔到子线程中去 。

首先使用Chrome打开 , 点击登录,你会看到下面这个界面 
这里写图片描述

在Chorme中按F12,调出开发者工具,切到Network选项卡,勾选Preserve Log,记得一定要勾选,不然你会看不到信息。

这里写图片描述

一切就绪后,在输入框中输出账号密码点击登录,登录成功后你会看到这么一条记录

这里写图片描述

点击图中的email,在最下方你会看到本次请求提交了4个参数,以及在上方,你会看到本次请求的地址是

这里写图片描述

这里写图片描述

你会惊讶的发现知乎的密码是明文传输的,提交的参数的意思也很简单,email就是账号,password就是密码,remember_me就是是否记住,这里传true就可以了,还有一个_xsrf参数,这个毛估估应该是防爬虫的。因此在提交前我们要从源代码中将这个值抓取下来。该值在表单的隐藏域中

这里写图片描述

一切准备就绪后,你就兴高采烈的用代码去模拟登录,然后你会发现会返回一个验证码错误的信息。其实,我们还需要提交一个验证码,其参数名为captcha,验证码的地址为,

http://www.zhihu.com/captcha.gif?r=时间戳

于是我们得出了这样的一个数据。

  • 请求地址
  • http://www.zhihu.com/login/email
  • 请求参数
  • _xsrf 表单中提取的隐藏域的值captcha 验证码email 邮箱password 密码remember_me 记住我

还有一个问题,验证码的值怎么得到呢,答案是人工输入,将验证码保存到本地进行认为识别,输入后进行登陆即可。

这里的网络请求使用OkHttp,以及解析使用Jsoup,然后我们会使用到Gson,将他们加入maven依赖

    <dependencies>        <dependency>            <groupId>com.squareup.okhttp
groupId
> <artifactId>okhttp
artifactId
> <version>2.4.0
version
>
dependency
> <dependency> <groupId>org.jsoup
groupId
> <artifactId>jsoup
artifactId
> <version>1.8.3
version
>
dependency
> <dependency> <groupId>com.google.code.gson
groupId
> <artifactId>gson
artifactId
> <version>2.3.1
version
>
dependency
>
dependencies
>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

在编码之前,我们得想想怎么维持登陆状态,没错,就是Cookie如何保持,我们只进行登陆一次,后续都直接采集数据就可以了,因此需要将cookie持久化,对之前的文章中的一个Android类进行改造。使其变成java平台可用的类,可以看到我们将它从之前保存到SharePrefrences中改成了保存到文件中,并以json形式存储,这就是为什么会用到Gson的原因了

package cn.edu.zafu.zhihu;import com.google.gson.Gson;import com.google.gson.GsonBuilder;import com.google.gson.reflect.TypeToken;import java.io.*;import java.net.CookieStore;import java.net.HttpCookie;import java.net.URI;import java.net.URISyntaxException;import java.util.*;import java.util.concurrent.ConcurrentHashMap;/** * User:lizhangqu(513163535@qq.com) * Date:2015-07-18 * Time: 16:54 */public class PersistentCookieStore implements CookieStore {    private static final Gson gson= new GsonBuilder().setPrettyPrinting().create();    private static final String LOG_TAG = "PersistentCookieStore";    private static final String COOKIE_PREFS = "CookiePrefsFile";    private static final String COOKIE_NAME_PREFIX = "cookie_";    private final HashMap
> cookies; private Map
cookiePrefs=new HashMap
(); /** * Construct a persistent cookie store. * */ public PersistentCookieStore() { String cookieJson = readFile("cookie.json"); Map
fromJson = gson.fromJson(cookieJson,new TypeToken
>() {}.getType()); if(fromJson!=null){ System.out.println(fromJson); cookiePrefs=fromJson; } cookies = new HashMap
>(); // Load any previously stored cookies into the store for(Map.Entry
entry : cookiePrefs.entrySet()) { if (((String)entry.getValue()) != null && !((String)entry.getValue()).startsWith(COOKIE_NAME_PREFIX)) { String[] cookieNames = split((String) entry.getValue(), ","); for (String name : cookieNames) { String encodedCookie = cookiePrefs.get(COOKIE_NAME_PREFIX + name); if (encodedCookie != null) { HttpCookie decodedCookie = decodeCookie(encodedCookie); if (decodedCookie != null) { if(!cookies.containsKey(entry.getKey())) cookies.put(entry.getKey(), new ConcurrentHashMap
()); cookies.get(entry.getKey()).put(name, decodedCookie); } } } } } } public void add(URI uri, HttpCookie cookie) { String name = getCookieToken(uri, cookie); // Save cookie into local store, or remove if expired if (!cookie.hasExpired()) { if(!cookies.containsKey(uri.getHost())) cookies.put(uri.getHost(), new ConcurrentHashMap
()); cookies.get(uri.getHost()).put(name, cookie); } else { if(cookies.containsKey(uri.toString())) cookies.get(uri.getHost()).remove(name); } cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet())); cookiePrefs.put(COOKIE_NAME_PREFIX + name, encodeCookie(new SerializableHttpCookie(cookie))); String json=gson.toJson(cookiePrefs); saveFile(json.getBytes(), "cookie.json"); } protected String getCookieToken(URI uri, HttpCookie cookie) { return cookie.getName() + cookie.getDomain(); } public List
get(URI uri) { ArrayList
ret = new ArrayList
(); if(cookies.containsKey(uri.getHost())) ret.addAll(cookies.get(uri.getHost()).values()); return ret; } public boolean removeAll() { cookiePrefs.clear(); cookies.clear(); return true; } public boolean remove(URI uri, HttpCookie cookie) { String name = getCookieToken(uri, cookie); if(cookies.containsKey(uri.getHost()) && cookies.get(uri.getHost()).containsKey(name)) { cookies.get(uri.getHost()).remove(name); if(cookiePrefs.containsKey(COOKIE_NAME_PREFIX + name)) { cookiePrefs.remove(COOKIE_NAME_PREFIX + name); } cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet())); return true; } else { return false; } } public List
getCookies() { ArrayList
ret = new ArrayList
(); for (String key : cookies.keySet()) ret.addAll(cookies.get(key).values()); return ret; } public List
getURIs() { ArrayList
ret = new ArrayList
(); for (String key : cookies.keySet()) try { ret.add(new URI(key)); } catch (URISyntaxException e) { e.printStackTrace(); } return ret; } /** * Serializes Cookie object into String * * @param cookie cookie to be encoded, can be null * @return cookie encoded as String */ protected String encodeCookie(SerializableHttpCookie cookie) { if (cookie == null) return null; ByteArrayOutputStream os = new ByteArrayOutputStream(); try { ObjectOutputStream outputStream = new ObjectOutputStream(os); outputStream.writeObject(cookie); } catch (IOException e) { System.out.println("IOException in encodeCookie"+ e); return null; } return byteArrayToHexString(os.toByteArray()); } /** * Returns cookie decoded from cookie string * * @param cookieString string of cookie as returned from http request * @return decoded cookie or null if exception occured */ protected HttpCookie decodeCookie(String cookieString) { byte[] bytes = hexStringToByteArray(cookieString); ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes); HttpCookie cookie = null; try { ObjectInputStream objectInputStream = new ObjectInputStream(byteArrayInputStream); cookie = ((SerializableHttpCookie) objectInputStream.readObject()).getCookie(); } catch (IOException e) { System.out.println("IOException in decodeCookie"+e); } catch (ClassNotFoundException e) { System.out.println("ClassNotFoundException in decodeCookie"+e); } return cookie; } /** * Using some super basic byte array <-> hex conversions so we don't have to rely on any * large Base64 libraries. Can be overridden if you like! * * @param bytes byte array to be converted * @return string containing hex values */ protected String byteArrayToHexString(byte[] bytes) { StringBuilder sb = new StringBuilder(bytes.length * 2); for (byte element : bytes) { int v = element & 0xff; if (v < 16) { sb.append('0'); } sb.append(Integer.toHexString(v)); } return sb.toString().toUpperCase(Locale.US); } /** * Converts hex values from strings to byte arra * * @param hexString string of hex-encoded values * @return decoded byte array */ protected byte[] hexStringToByteArray(String hexString) { int len = hexString.length(); byte[] data = new byte[len / 2]; for (int i = 0; i < len; i += 2) { data[i / 2] = (byte) ((Character.digit(hexString.charAt(i), 16) << 4) + Character.digit(hexString.charAt(i + 1), 16)); } return data; } public static String join(CharSequence delimiter, Iterable tokens) { StringBuilder sb = new StringBuilder(); boolean firstTime = true; for (Object token: tokens) { if (firstTime) { firstTime = false; } else { sb.append(delimiter); } sb.append(token); } return sb.toString(); } public static String[] split(String text, String expression) { if (text.length() == 0) { return new String[]{}; } else { return text.split(expression, -1); } } public static void saveFile(byte[] bfile, String fileName) { BufferedOutputStream bos = null; FileOutputStream fos = null; File file = null; try { file = new File(fileName); fos = new FileOutputStream(file); bos = new BufferedOutputStream(fos); bos.write(bfile); } catch (Exception e) { e.printStackTrace(); } finally { if (bos != null) { try { bos.close(); } catch (IOException e1) { e1.printStackTrace(); } } if (fos != null) { try { fos.close(); } catch (IOException e1) { e1.printStackTrace(); } } } } public static String readFile(String fileName) { BufferedInputStream bis = null; FileInputStream fis = null; File file = null; try { file = new File(fileName); fis = new FileInputStream(file); bis = new BufferedInputStream(fis); int available = bis.available(); byte[] bytes=new byte[available]; bis.read(bytes); String str=new String(bytes); return str; } catch (Exception e) { e.printStackTrace(); } finally { if (bis != null) { try { bis.close(); } catch (IOException e1) { e1.printStackTrace(); } } if (fis != null) { try { fis.close(); } catch (IOException e1) { e1.printStackTrace(); } } } return ""; }}

然后新建一个OkHttp请求类,并设置其Cookie处理类为我们编写的类。

private static OkHttpClient client = new OkHttpClient();client.setCookieHandler(new CookieManager(new PersistentCookieStore(), CookiePolicy.ACCEPT_ALL));

好了,可以开始获取_xsrf以及验证码了。验证码保存在项目根目录下名为code.png的文件

private static String xsrf;public static void getCode() throws IOException{        Request request = new Request.Builder()        .url("http://www.zhihu.com/")        .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")        .build();        Response response = client.newCall(request).execute();        String result = response.body().string();        Document parse = Jsoup.parse(result);        System.out.println(parse + "");        result = parse.select("input[type=hidden]").get(0).attr("value")                .trim();        xsrf=result;        System.out.println("_xsrf:" + result);        String codeUrl = "http://www.zhihu.com/captcha.gif?r=";        codeUrl += System.currentTimeMillis();        System.out.println("codeUrl:" + codeUrl);        Request getcode = new Request.Builder()                .url(codeUrl)                .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")                .build();        Response code = client.newCall(getcode).execute();        byte[] bytes = code.body().bytes();        saveCode(bytes, "code.png");    }    public static void saveCode(byte[] bfile, String fileName) {        BufferedOutputStream bos = null;        FileOutputStream fos = null;        File file = null;        try {            file = new File(fileName);            fos = new FileOutputStream(file);            bos = new BufferedOutputStream(fos);            bos.write(bfile);        } catch (Exception e) {            e.printStackTrace();        } finally {            if (bos != null) {                try {                    bos.close();                } catch (IOException e1) {                    e1.printStackTrace();                }            }            if (fos != null) {                try {                    fos.close();                } catch (IOException e1) {                    e1.printStackTrace();                }            }        }    }

然后将获取来的参数连同账号密码进行提交登录

public static void login(String randCode,String email,String password) throws IOException{        RequestBody formBody = new FormEncodingBuilder()        .add("_xsrf", xsrf)        .add("captcha", randCode)        .add("email", email)        .add("password", password)        .add("remember_me", "true")        .build();        Request login = new Request.Builder()        .url("http://www.zhihu.com/login/email")        .post(formBody)        .addHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")        .build();        Response execute = client.newCall(login).execute();        System.out.println(decode(execute.body().string()));    }public static String decode(String unicodeStr) {        if (unicodeStr == null) {            return null;        }        StringBuffer retBuf = new StringBuffer();        int maxLoop = unicodeStr.length();        for (int i = 0; i < maxLoop; i++) {            if (unicodeStr.charAt(i) == '\\') {                if ((i < maxLoop - 5)                        && ((unicodeStr.charAt(i + 1) == 'u') || (unicodeStr                        .charAt(i + 1) == 'U')))                    try {                        retBuf.append((char) Integer.parseInt(                                unicodeStr.substring(i + 2, i + 6), 16));                        i += 5;                    } catch (NumberFormatException localNumberFormatException) {                        retBuf.append(unicodeStr.charAt(i));                    }                else                    retBuf.append(unicodeStr.charAt(i));            } else {                retBuf.append(unicodeStr.charAt(i));            }        }        return retBuf.toString();    }

当看到下面的信息就代码登录成功了

这里写图片描述

之后你就可以获取你想要的信息了,这里简单获取一些信息,比如我要获取轮子哥的followers的昵称,分页自己处理下就ok了。

public static void getFollowers() throws IOException{        Request request = new Request.Builder()        .url("http://www.zhihu.com/people/zord-vczh/followees")        .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")        .build();        Response response = client.newCall(request).execute();        String result=response.body().string();        Document parse = Jsoup.parse(result);        Elements select = parse.select("div.zm-profile-card");        StringBuilder builder=new StringBuilder();        for (int i=0;i

下图就是获取到的信息。当然,只要你登录了,什么信息你都可以获取到。 
这里写图片描述

最后上源码,Intelij的maven项目 

你可能感兴趣的文章
如何使用BBC英语学习频道
查看>>
spring事务探索
查看>>
浅谈Spring声明式事务管理ThreadLocal和JDKProxy
查看>>
初识xsd
查看>>
java 设计模式-职责型模式
查看>>
构造型模式
查看>>
svn out of date 无法更新到最新版本
查看>>
java杂记
查看>>
RunTime.getRuntime().exec()
查看>>
Oracle 分组排序函数
查看>>
删除weblogic 域
查看>>
VMware Workstation 14中文破解版下载(附密钥)(笔记)
查看>>
日志框架学习
查看>>
日志框架学习2
查看>>
SVN-无法查看log,提示Want to go offline,时间显示1970问题,error主要是 url中 有一层的中文进行了2次encode
查看>>
NGINX
查看>>
Qt文件夹选择对话框
查看>>
1062 Talent and Virtue (25 分)
查看>>
1061 Dating (20 分)
查看>>
1060 Are They Equal (25 分)
查看>>