Yuhang Zheng

RTL8211F芯片4芯网线对接千兆协商成千兆问题

N 人看过

问题背景:

目前接到客户反馈,在OK3588平台使用百兆网线连接千兆设备无法正确切换成百兆,它还是协商成千兆,造成网络无法联通。且使用ethtool工具强制配置成百兆之后可以联通。

但是使用百兆网线连接百兆设备,可以正常协商成百兆。

解决思路:

接到问题之后,首先是网上寻找有没有类似的问题,然后就看到了这个文章:

参考文章https://blog.csdn.net/Emo_snaf/article/details/120762203

根据这个文章,得到的修改思路是:修改读取Link Partner的能力函数genphy_read_lpa(),增加判断GBCR (1000Base-T Control Register, Address 0x09)的bit 9,也就是MII_CTRL1000寄存器的ADVERTISE_1000FULL位,检查自己是否支持Advertise 1000Base-T Full-Duplex能力,如果不支持,则去修改读到的对方Link Partner 1000Base-T Full Duplex的能力为不支持千兆。这样自然程序后面也就会把速度协商为百兆了。

源码位置./include/uapi/linux/mii.h

img

协商能力寄存器: MII_CTRL1000 0x09

对方协商能力寄存器: MII_STAT1000 0x0a

1000Base-T Full-Duple能力位: ADVERTISE_1000FULL 0x0200 第9位

Link partner 1000BASE-T Full-Duple能力位: LPA_1000FULL 0x0800 第11位

img

img

初步解决方法:

img

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index e83428c92b33..d4eb8f2253d3 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2307,6 +2307,9 @@ int genphy_read_lpa(struct phy_device *phydev)
                        if (lpagb < 0)
                                return lpagb;

+                       if (!(phy_read(phydev, MII_CTRL1000) & ADVERTISE_1000FULL))
+                               lpagb = lpagb & ~LPA_1000FULL;
+
                        if (lpagb & LPA_1000MSFAIL) {
                                int adv = phy_read(phydev, MII_CTRL1000);

这样使用ethtool去查看网口信息时,打印结果如下:

Settings for eth0:
        Supported ports: [ TP    MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                             100baseT/Half 100baseT/Full
        Link partner advertised pause frame use: Symmetric Receive-only
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 100Mb/s
        Duplex: Full
        Auto-negotiation: on
        master-slave cfg: preferred slave
        master-slave status: slave
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: external
        MDI-X: Unknown
        Supports Wake-on: ug
        Wake-on: d
        Current message level: 0x0000003f (63)
                               drv probe link timer ifdown ifup
        Link detected: yes

可以看到,确实Link Partner advertised link modes里面就没有1000baseT/Full了,而且它的Speed: 100Mb/s也已经正确了。

如果就问题来说,这样便可以解决了,可以认为这个是一个内核PHY驱动的bug,但是总感觉还是有深入研究以下的必要。

进一步排查:

后续测试中发现,在使用使用百兆网线连接千兆设备时,OK3568的Linux5.10系统也是存在这个问题,识别成千兆速度,网络无法联通,但是在Linux4.19系统中就没有这个问题了。

用ethtool去查看网口信息时可以发现,Linux5.10系统是这样的:

Settings for eth0:
        Supported ports: [ TP    MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                             100baseT/Half 100baseT/Full
                                             1000baseT/Full
        Link partner advertised pause frame use: Symmetric Receive-only
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Auto-negotiation: on
        master-slave cfg: preferred slave
        master-slave status: slave
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: external
        MDI-X: Unknown
        Supports Wake-on: ug
        Wake-on: d
        Current message level: 0x0000003f (63)
                               drv probe link timer ifdown ifup
        Link detected: yes

Linux4.19系统是这样的:

Settings for eth0:
        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                             100baseT/Half 100baseT/Full
                                             1000baseT/Full
        Link partner advertised pause frame use: Symmetric Receive-only
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 100Mb/s
        Duplex: Full
        Port: MII
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: on
        Supports Wake-on: ug
        Wake-on: d
        Current message level: 0x0000003f (63)
                               drv probe link timer ifdown ifup
        Link detected: yes

可以明显发现的一点是,Linux4.19的系统中的Link partner advertised link modes:中,也是存在1000baseT/Full,也就是说,它认为对端设备是有千兆的能力。而且Advertised link modes:中,也是存在1000baseT/Full的,它也认为自己有千兆的能力。但是它却没有使用千兆的速度。

这就说明肯定是Linux4.19系统自己肯定对协商速度进行了修改。

根据上面我们解决问题时候的方法,猜测Linux4.19应该也是使用了REG09.BIT9这一位进行了处理。所以我们主要线索就是去找PHY驱动的代码中哪里使用了MII_CTRL1000寄存器和ADVERTISE_1000FULL这一位。

后续发现在./drivers/net/phy/phy_device.c中的genphy_read_status()函数,有以下内容:

/**
 * genphy_read_status - check the link status and update current link state
 * @phydev: target phy_device struct
 *
 * Description: Check the link, then figure out the current state
 *   by comparing what we advertise with what the link partner
 *   advertises.  Start by checking the gigabit possibilities,
 *   then move on to 10/100.
 */
int genphy_read_status(struct phy_device *phydev)
{
    int adv;
    int err;
    int lpa;
    int lpagb = 0;
    int common_adv;
    int common_adv_gb = 0;

    /* Update the link, but return if there was an error */
    err = genphy_update_link(phydev);
    if (err)
        return err;

    phydev->lp_advertising = 0;

    if (AUTONEG_ENABLE == phydev->autoneg) {
        if (phydev->supported & (SUPPORTED_1000baseT_Half
                                         | SUPPORTED_1000baseT_Full)) {
            lpagb = phy_read(phydev, MII_STAT1000);
            if (lpagb < 0)
                return lpagb;

            adv = phy_read(phydev, MII_CTRL1000);
            if (adv < 0)
                return adv;

            if (lpagb & LPA_1000MSFAIL) {
                if (adv & CTL1000_ENABLE_MASTER)
                    phydev_err(phydev, "Master/Slave resolution failed, maybe conflicting manual settings?\n");
                else
                    phydev_err(phydev, "Master/Slave resolution failed\n");
                return -ENOLINK;
            }

            phydev->lp_advertising =
                mii_stat1000_to_ethtool_lpa_t(lpagb);
            common_adv_gb = lpagb & adv << 2;
        }

        lpa = phy_read(phydev, MII_LPA);
        if (lpa < 0)
            return lpa;

        phydev->lp_advertising |= mii_lpa_to_ethtool_lpa_t(lpa);

        adv = phy_read(phydev, MII_ADVERTISE);
        if (adv < 0)
            return adv;

        common_adv = lpa & adv;

        phydev->speed = SPEED_10;
        phydev->duplex = DUPLEX_HALF;
        phydev->pause = 0;
        phydev->asym_pause = 0;

        if (common_adv_gb & (LPA_1000FULL | LPA_1000HALF)) {
            phydev->speed = SPEED_1000;

            if (common_adv_gb & LPA_1000FULL)
                phydev->duplex = DUPLEX_FULL;
        } else if (common_adv & (LPA_100FULL | LPA_100HALF)) {
            phydev->speed = SPEED_100;

            if (common_adv & LPA_100FULL)
                phydev->duplex = DUPLEX_FULL;
        } else
            if (common_adv & LPA_10FULL)
                phydev->duplex = DUPLEX_FULL;

        if (phydev->duplex == DUPLEX_FULL) {
            phydev->pause = lpa & LPA_PAUSE_CAP ? 1 : 0;
            phydev->asym_pause = lpa & LPA_PAUSE_ASYM ? 1 : 0;
        }
    } else {
        int bmcr = phy_read(phydev, MII_BMCR);

        if (bmcr < 0)
            return bmcr;

        if (bmcr & BMCR_FULLDPLX)
            phydev->duplex = DUPLEX_FULL;
        else
            phydev->duplex = DUPLEX_HALF;

        if (bmcr & BMCR_SPEED1000)
            phydev->speed = SPEED_1000;
        else if (bmcr & BMCR_SPEED100)
            phydev->speed = SPEED_100;
        else
            phydev->speed = SPEED_10;

        phydev->pause = 0;
        phydev->asym_pause = 0;
    }

    return 0;
}
EXPORT_SYMBOL(genphy_read_status);

可以发现,上述代码中:

第29行:读取MII_STAT1000寄存器,获取了对方PHY设备的千兆协商能力,储存在了lpagb

第32行:读取MII_CTRL1000寄存器,获取了自己的千兆协商能力,储存在了adv

第47行:将两个协商能力相与,判断共同双方具有的千兆的能力,储存在了common_adv_gb

第50行:读取MII_LPA寄存器,获取对方了对方PHY设备的十兆以及百兆的协商能力,储存在了lpa

第56行:读取MII_ADVERTISE寄存器,获取了自己的十兆以及百兆的协商能力,储存在了adv

第60行:将两个协商能力相与,判断共同双方具有的十兆以及百兆的能力,储存在了common_adv

第67-79行:根据common_adv_gb以及common_adv,最终确定使用的phydev->speed

通过梳理这个函数代码,我们明白了Linux4.19系统的实现过程。但是Linux5.10的函数却没有了这部分。

Linux5.10的genphy_read_status()函数如下

/**
 * genphy_read_status - check the link status and update current link state
 * @phydev: target phy_device struct
 *
 * Description: Check the link, then figure out the current state
 *   by comparing what we advertise with what the link partner
 *   advertises.  Start by checking the gigabit possibilities,
 *   then move on to 10/100.
 */
int genphy_read_status(struct phy_device *phydev)
{
        int err, old_link = phydev->link;

        /* Update the link, but return if there was an error */
        err = genphy_update_link(phydev);
        if (err)
                return err;

        /* why bother the PHY if nothing can have changed */
        if (phydev->autoneg == AUTONEG_ENABLE && old_link && phydev->link)
                return 0;

        phydev->speed = SPEED_UNKNOWN;
        phydev->duplex = DUPLEX_UNKNOWN;
        phydev->pause = 0;
        phydev->asym_pause = 0;

        err = genphy_read_master_slave(phydev);
        if (err < 0)
                return err;

        err = genphy_read_lpa(phydev);
        if (err < 0)
                return err;

        if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
                phy_resolve_aneg_linkmode(phydev);
        } else if (phydev->autoneg == AUTONEG_DISABLE) {
                err = genphy_read_status_fixed(phydev);
                if (err < 0)
                        return err;
        }

        return 0;
}
EXPORT_SYMBOL(genphy_read_status);

可以看到这个函数精炼了很多,它把之前Linux4.19上所做的操作都打包成了函数处理。

如增加了一个genphy_read_master_slave()函数,而把之前Linux4.19上的大部分操作都放到了genphy_read_lpa()函数里面:

int genphy_read_lpa(struct phy_device *phydev)
{
        int lpa, lpagb;

        if (phydev->autoneg == AUTONEG_ENABLE) {
                if (!phydev->autoneg_complete) {
                        mii_stat1000_mod_linkmode_lpa_t(phydev->lp_advertising,
                                                        0);
                        mii_lpa_mod_linkmode_lpa_t(phydev->lp_advertising, 0);
                        return 0;
                }

                if (phydev->is_gigabit_capable) {
                        lpagb = phy_read(phydev, MII_STAT1000);
                        if (lpagb < 0)
                                return lpagb;

                        if (lpagb & LPA_1000MSFAIL) {
                                int adv = phy_read(phydev, MII_CTRL1000);

                                if (adv < 0)
                                        return adv;

                                if (adv & CTL1000_ENABLE_MASTER)
                                        phydev_err(phydev, "Master/Slave resolution failed, maybe conflicting manual settings?\n");
                                else
                                        phydev_err(phydev, "Master/Slave resolution failed\n");
                                return -ENOLINK;
                        }

                        mii_stat1000_mod_linkmode_lpa_t(phydev->lp_advertising,
                                                        lpagb);
                }

                lpa = phy_read(phydev, MII_LPA);
                if (lpa < 0)
                        return lpa;

                mii_lpa_mod_linkmode_lpa_t(phydev->lp_advertising, lpa);
        } else {
                linkmode_zero(phydev->lp_advertising);
        }

        return 0;
}
EXPORT_SYMBOL(genphy_read_lpa);

在这个函数中:

第14行:读取MII_STAT1000寄存器,获取了对方PHY设备的千兆协商能力,储存在了lpagb

第19行:读取MII_LPA寄存器,获取对方了对方PHY设备的十兆以及百兆的协商能力,储存在了lpa

然后将lpagb储存在了phydev->lp_advertising,将lpa储存在了phydev->lp_advertising

但是没有对自己的千兆协商能力进行什么处理,也仅仅是用来判断了一下Master/Slave解析的情况。

在后面的phy_resolve_aneg_linkmode()函数中,位置为drivers/net/phy/phy-core.c

/**
 * phy_resolve_aneg_linkmode - resolve the advertisements into PHY settings
 * @phydev: The phy_device struct
 *
 * Resolve our and the link partner advertisements into their corresponding
 * speed and duplex. If full duplex was negotiated, extract the pause mode
 * from the link partner mask.
 */
void phy_resolve_aneg_linkmode(struct phy_device *phydev)
{
        __ETHTOOL_DECLARE_LINK_MODE_MASK(common);
        int i;

        linkmode_and(common, phydev->lp_advertising, phydev->advertising);

        for (i = 0; i < ARRAY_SIZE(settings); i++)
                if (test_bit(settings[i].bit, common)) {
                        phydev->speed = settings[i].speed;
                        phydev->duplex = settings[i].duplex;
                        break;
                }

        phy_resolve_aneg_pause(phydev);
}
EXPORT_SYMBOL_GPL(phy_resolve_aneg_linkmode);

可以看到程序会根据phydev->lp_advertisingphydev->advertising的值,计算出两个PHY之间支持到的最大的速度与双工半双工的能力。

在上面的genphy_read_lpa()函数中,我们已经看到了phydev->lp_advertising的由来,那phydev->advertising的值是从哪里来的呢?

经过不断的追代码,最终我们找到了在phy_probe()函数中:

/**
 * phy_probe - probe and init a PHY device
 * @dev: device to probe and init
 *
 * Description: Take care of setting up the phy_device structure,
 *   set the state to READY (the driver's init function should
 *   set it to STARTING if needed).
 */
static int phy_probe(struct device *dev)
{
        struct phy_device *phydev = to_phy_device(dev);
        struct device_driver *drv = phydev->mdio.dev.driver;
        struct phy_driver *phydrv = to_phy_driver(drv);
        int err = 0;

        phydev->drv = phydrv;

        /* Disable the interrupt if the PHY doesn't support it
         * but the interrupt is still a valid one
         */
         if (!phy_drv_supports_irq(phydrv) && phy_interrupt_is_valid(phydev))
                phydev->irq = PHY_POLL;

        if (phydrv->flags & PHY_IS_INTERNAL)
                phydev->is_internal = true;

        mutex_lock(&phydev->lock);

        /* Deassert the reset signal */
        phy_device_reset(phydev, 0);

        if (phydev->drv->probe) {
                err = phydev->drv->probe(phydev);
                if (err)
                        goto out;
        }

        /* Start out supporting everything. Eventually,
         * a controller will attach, and may modify one
         * or both of these values
         */
        if (phydrv->features) {
                linkmode_copy(phydev->supported, phydrv->features);
        } else if (phydrv->get_features) {
                err = phydrv->get_features(phydev);
        } else if (phydev->is_c45) {
                err = genphy_c45_pma_read_abilities(phydev);
        } else {
                err = genphy_read_abilities(phydev);
        }

        if (err)
                goto out;

        if (!linkmode_test_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
                               phydev->supported))
                phydev->autoneg = 0;

        if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
                              phydev->supported))
                phydev->is_gigabit_capable = 1;
        if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
                              phydev->supported))
                phydev->is_gigabit_capable = 1;

        of_set_phy_supported(phydev);
        phy_advertise_supported(phydev);

        /* Get the EEE modes we want to prohibit. We will ask
         * the PHY stop advertising these mode later on
         */
        of_set_phy_eee_broken(phydev);

        /* The Pause Frame bits indicate that the PHY can support passing
         * pause frames. During autonegotiation, the PHYs will determine if
         * they should allow pause frames to pass.  The MAC driver should then
         * use that result to determine whether to enable flow control via
         * pause frames.
         *
         * Normally, PHY drivers should not set the Pause bits, and instead
         * allow phylib to do that.  However, there may be some situations
         * (e.g. hardware erratum) where the driver wants to set only one
         * of these bits.
         */
        if (!test_bit(ETHTOOL_LINK_MODE_Pause_BIT, phydev->supported) &&
            !test_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, phydev->supported)) {
                linkmode_set_bit(ETHTOOL_LINK_MODE_Pause_BIT,
                                 phydev->supported);
                linkmode_set_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
                                 phydev->supported);
        }

        /* Set the state to READY by default */
        phydev->state = PHY_READY;

out:
        /* Assert the reset signal */
        if (err)
                phy_device_reset(phydev, 1);

        mutex_unlock(&phydev->lock);

        return err;
}

第41行:genphy_read_abilities()函数中,获取了phy的可支持的能力,genphy_read_abilities()函数函数具体实现如下:

/**
 * genphy_read_abilities - read PHY abilities from Clause 22 registers
 * @phydev: target phy_device struct
 *
 * Description: Reads the PHY's abilities and populates
 * phydev->supported accordingly.
 *
 * Returns: 0 on success, < 0 on failure
 */
int genphy_read_abilities(struct phy_device *phydev)
{
        int val;

        linkmode_set_bit_array(phy_basic_ports_array,
                               ARRAY_SIZE(phy_basic_ports_array),
                               phydev->supported);

        val = phy_read(phydev, MII_BMSR);
        if (val < 0)
                return val;

        linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, phydev->supported,
                         val & BMSR_ANEGCAPABLE);

        linkmode_mod_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT, phydev->supported,
                         val & BMSR_100FULL);
        linkmode_mod_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT, phydev->supported,
                         val & BMSR_100HALF);
        linkmode_mod_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT, phydev->supported,
                         val & BMSR_10FULL);
        linkmode_mod_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT, phydev->supported,
                         val & BMSR_10HALF);

        if (val & BMSR_ESTATEN) {
                val = phy_read(phydev, MII_ESTATUS);
                if (val < 0)
                        return val;

                linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
                                 phydev->supported, val & ESTATUS_1000_TFULL);
                linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
                                 phydev->supported, val & ESTATUS_1000_THALF);
                linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseX_Full_BIT,
                                 phydev->supported, val & ESTATUS_1000_XFULL);
        }

        return 0;
}
EXPORT_SYMBOL(genphy_read_abilities);

可以看到它首先是读取了PHY芯片的MII_BMSR寄存器,获取了PHY芯片支持的百兆能力,并根据能力将phydev->supported进行置位。

img

然后在第25行使用if (val & BMSR_ESTATEN) 判断MII_BMSR寄存器的第BMSR_ESTATEN位,也就是PHY寄存器的BMSR (Basic Mode Status Register, Address 0x01)的bit8,查看PHY是否具有千兆的能力。

如果有的话,就在第26行读取MII_ESTATUS寄存器,获取具体的支持千兆的能力,并根据能力将phydev->supported进行置位。

img

执行完genphy_read_abilities()函数的代码后,我们跳回到phy_probe()函数中

phy_probe()函数的第59行,phy_advertise_supported()函数中:

/**
 * phy_remove_link_mode - Remove a supported link mode
 * @phydev: phy_device structure to remove link mode from
 * @link_mode: Link mode to be removed
 *
 * Description: Some MACs don't support all link modes which the PHY
 * does.  e.g. a 1G MAC often does not support 1000Half. Add a helper
 * to remove a link mode.
 */
void phy_advertise_supported(struct phy_device *phydev)
{
        __ETHTOOL_DECLARE_LINK_MODE_MASK(new);

        linkmode_copy(new, phydev->supported);
        phy_copy_pause_bits(new, phydev->advertising);
        linkmode_copy(phydev->advertising, new);
}
EXPORT_SYMBOL(phy_advertise_supported);

可以看到这里将phydev->supported支持的模式直接linkmode_copy给了phydev->advertising,也就是说,phydev->advertising直接声明了PHY所支持的所有的能力。

那最后再接着说回来phy_resolve_aneg_linkmode()函数,也就知道为什么Linux5.10内核最终得到的结果是支持千兆全双工的速度了。


后续我们在OK3576上测试发现,Linux6.1的系统上已经修复了这个问题。在Linux6.1的系统上使用ethtool命令查看网口的信息如下:

Settings for eth0:
        Supported ports: [ TP    MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                             100baseT/Half 100baseT/Full
                                             1000baseT/Full
        Link partner advertised pause frame use: Symmetric Receive-only
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 100Mb/s
        Duplex: Full
        Auto-negotiation: on
        master-slave cfg: preferred slave
        master-slave status: slave
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: external
        MDI-X: Unknown
        Supports Wake-on: ug
        Wake-on: d
        Current message level: 0x0000003f (63)
                               drv probe link timer ifdown ifup
        Link detected: yes

可以看到打印信息与Linux4.19上查看的信息基本一致,也是它也认为自己和对方都有千兆的能力,但是它却没有使用千兆的速度。

我一开始也以为Linux6.1应该是发现了Linux5.10上存在的问题进行了修复,但是当我使用同样的排查方法去寻找Linux6.1上所做的更改的时候,发现根本找不到类似Linux4.19的逻辑。

而且我对比了Linux5.10与Linux4.19的PHY驱动文件,发现几乎一样,没有什么太大的变动。

但是我们注意到了Linux6.1系统上,网口连接的时候内核有以下打印信息:

[   14.611725] RTL8211F Gigabit Ethernet stmmac-0:01: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[   14.613638] rk_gmac-dwmac 2a220000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[   14.613669] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

于是顺着这个线索来找打印信息,发现是在drivers/net/phy/phy-core.c文件的phy_check_downshift()函数中:

/**
 * phy_check_downshift - check whether downshift occurred
 * @phydev: The phy_device struct
 *
 * Check whether a downshift to a lower speed occurred. If this should be the
 * case warn the user.
 * Prerequisite for detecting downshift is that PHY driver implements the
 * read_status callback and sets phydev->speed to the actual link speed.
 */
void phy_check_downshift(struct phy_device *phydev)
{
        __ETHTOOL_DECLARE_LINK_MODE_MASK(common);
        int i, speed = SPEED_UNKNOWN;

        phydev->downshifted_rate = 0;

        if (phydev->autoneg == AUTONEG_DISABLE ||
            phydev->speed == SPEED_UNKNOWN)
                return;

        linkmode_and(common, phydev->lp_advertising, phydev->advertising);

        for (i = 0; i < ARRAY_SIZE(settings); i++)
                if (test_bit(settings[i].bit, common)) {
                        speed = settings[i].speed;
                        break;
                }

        if (speed == SPEED_UNKNOWN || phydev->speed >= speed)
                return;

        phydev_warn(phydev, "Downshift occurred from negotiated speed %s to actual speed %s, check cabling!\n",
                    phy_speed_to_str(speed), phy_speed_to_str(phydev->speed));

        phydev->downshifted_rate = 1;
}
EXPORT_SYMBOL_GPL(phy_check_downshift);

可以看到这个函数会根据phydev->lp_advertising所记录的对端PHY的能力和phydev->advertising所记录的自己的PHY支持的能力,计算出共同的支持的能力,然后遍历settings里面支持的速率,然后计算出应设的最大的速度,储存在speed变量中。

然后在下面判断实际的phydev->speed的值与应设的speed的大小,如果phydev->speed小于speed,则使用更小的phydev->speed,并打印出报错:Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!

那么接下来的问题就是,为什么同样的Linux5.10内核也有这个函数,但是却没有触发这个动作呢?也就是说,为什么我们在之前的代码中分析了,phydev->speed的值在genphy_read_status()函数中,也是通过phydev->lp_advertisingphydev->advertising计算出了千兆的能力,但是在这个phy_check_downshift()函数中,phydev->speed却是个百兆呢?

我们继续找到调用phy_check_downshift()函数的位置:

/**
 * phy_check_link_status - check link status and set state accordingly
 * @phydev: the phy_device struct
 *
 * Description: Check for link and whether autoneg was triggered / is running
 * and set state accordingly
 */
static int phy_check_link_status(struct phy_device *phydev)
{
        int err;

        WARN_ON(!mutex_is_locked(&phydev->lock));

        /* Keep previous state if loopback is enabled because some PHYs
         * report that Link is Down when loopback is enabled.
         */
        if (phydev->loopback_enabled)
                return 0;

        err = phy_read_status(phydev);
        if (err)
                return err;

        if (phydev->link && phydev->state != PHY_RUNNING) {
                phy_check_downshift(phydev);
                phydev->state = PHY_RUNNING;
                phy_link_up(phydev);
        } else if (!phydev->link && phydev->state != PHY_NOLINK) {
                phydev->state = PHY_NOLINK;
                phy_link_down(phydev);
        }

        return 0;
}

我们看到,该函数在第20行调用了phy_read_status()函数:

static inline int phy_read_status(struct phy_device *phydev)
{
        if (!phydev->drv)
                return -EIO;

        if (phydev->drv->read_status)
        {
                return phydev->drv->read_status(phydev);
        }
        else
        {
                return genphy_read_status(phydev);
        }
}

phy_read_status()函数会去执行genphy_read_status()函数,我们在genphy_read_status()函数的末尾加打印信息打印phydev->speed的值,发现还是1000,那为什么到了第25行phy_check_downshift()函数的时候就变成100了呢?

这个时候我们在phy_read_status()函数中增加打印可以发现,原来在phy_read_status()函数中会执行phydev->drv->read_status()函数,也就是第一条分支,那为什么我们之前发现最终会调用到genphy_read_status()函数呢?

我们直接在genphy_read_status()函数中增加dump_stack();直接在内核中打印出调用的逻辑:

[ 3675.821941] Call trace:
[ 3675.821952]  dump_backtrace+0xdc/0x130
[ 3675.821971]  show_stack+0x1c/0x30
[ 3675.821992]  dump_stack_lvl+0x64/0x7c
[ 3675.822009]  dump_stack+0x14/0x2c
[ 3675.822028]  genphy_read_status+0x24/0x174
[ 3675.822047]  rtlgen_read_status+0x1c/0x4c
[ 3675.822069]  phy_read_status+0x60/0x8c
[ 3675.822086]  phy_check_link_status+0x84/0x150
[ 3675.822104]  phy_state_machine+0x26c/0x280
[ 3675.822125]  process_one_work+0x1e8/0x454
[ 3675.822145]  worker_thread+0x174/0x52c
[ 3675.822166]  kthread+0xdc/0xe0
[ 3675.822185]  ret_from_fork+0x10/0x20

结果发现原来如此呀,phy_read_status()函数调用了rtlgen_read_status()函数,然后rtlgen_read_status()函数调用了genphy_read_status(),那我们直接去找rtlgen_read_status()函数在哪里就行了。

原来是在drivers/net/phy/realtek.c文件里面:

static int rtlgen_read_status(struct phy_device *phydev)
{
        int ret;

        ret = genphy_read_status(phydev);
        if (ret < 0)
                return ret;

        return rtlgen_get_speed(phydev);
}

看起来rtlgen_read_status()函数调用了genphy_read_status()之后,最后又返回了 rtlgen_get_speed()函数,那继续来看 rtlgen_get_speed()函数:

/* get actual speed to cover the downshift case */
static int rtlgen_get_speed(struct phy_device *phydev)
{
        int val;

        if (!phydev->link)
                return 0;

        val = phy_read_paged(phydev, 0xa43, 0x12);
        if (val < 0)
                return val;

        switch (val & RTLGEN_SPEED_MASK) {
        case 0x0000:
                phydev->speed = SPEED_10;
                break;
        case 0x0010:
                phydev->speed = SPEED_100;
                break;
        case 0x0020:
                phydev->speed = SPEED_1000;
                break;
        case 0x0200:
                phydev->speed = SPEED_10000;
                break;
        case 0x0210:
                phydev->speed = SPEED_2500;
                break;
        case 0x0220:
                phydev->speed = SPEED_5000;
                break;
        default:
                break;
        }

        return 0;
}

这一下就完全的真相大白了!原来是在这里重新配置了phydev->speed!所有的一切就都说得通了!

不过最后还有一个疑问,为什么Linux6.1可以,Linux5.10就有问题呢?

我在Linux5.10内核的phy_read_status()函数中增加打印可以发现,原来在phy_read_status()函数中不会执行phydev->drv->read_status()函数,而是选择了直接去执行了genphy_read_status()

通过在genphy_read_status()函数中增加dump_stack();直接在内核中打印出调用的逻辑也可以看到这一点:

[ 9437.356211] Call trace:
[ 9437.356219]  dump_backtrace+0x0/0x1b0
[ 9437.356226]  show_stack+0x20/0x2c
[ 9437.356233]  dump_stack_lvl+0xc8/0xf8
[ 9437.356240]  dump_stack+0x18/0x34
[ 9437.356247]  genphy_read_status+0x28/0x1d4
[ 9437.356261]  phy_read_status+0x64/0x94
[ 9437.356267]  phy_check_link_status+0xb4/0x15c
[ 9437.356274]  phy_state_machine+0x190/0x264
[ 9437.356281]  process_one_work+0x1e0/0x298
[ 9437.356288]  worker_thread+0x1e0/0x278
[ 9437.356295]  kthread+0xf4/0x104
[ 9437.356302]  ret_from_fork+0x10/0x30

那为什么genphy_read_status()没有找到drivers/net/phy/realtek.c文件里面的rtlgen_read_status()函数呢?经过查看代码,原来是Linux5.10少了一行:

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index e3c77485f9ae..cb64474ce8b2 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -693,6 +693,7 @@ static struct phy_driver realtek_drvs[] = {
                .config_init    = &rtl8211f_config_init,
                .ack_interrupt  = &rtl8211f_ack_interrupt,
                .config_intr    = &rtl8211f_config_intr,
+               .read_status    = rtlgen_read_status,
                .suspend        = genphy_suspend,
                .resume         = rtl821x_resume,
                .read_page      = rtl821x_read_page,

加上这句之后,一切就全部都好用了!

至此,对PHY驱动的部分分析也就结束了。

最终写一下代码的调用逻辑关系吧:

phy_state_machine()
    phy_check_link_status()
        phy_read_status()
            rtlgen_read_status()//realak phy驱动实现读取PHY状态
                genphy_read_status()//调用通用驱动读取PHY状态
                    genphy_update_link()//更新 link 状态
                    genphy_read_master_slave() //如果是千兆网口,更新本端和对端的 master/slave
                    genphy_read_lpa()//更新对端(link partner) 声明的能力
                    phy_resolve_aneg_linkmode()    //自协商模式,解析 link 结果
                rtlgen_get_speed()//realak phy驱动设置实际协商速度
        phy_check_downshift()//检测是否需要降速

一些更详细的PHY驱动的内容可以参考以下文章:

https://cloud.tencent.com/developer/article/2355036